2.5: The Needleman-Wunsch Algorithm - Biology

We will now use dynamic programming to tackle the harder problem of general sequence alignment. The algorithm we will develop in the following sections to solve sequence alignment is known as the Needleman-Wunsch algorithm.

Dynamic programming vs. memoization

Before we dive into the algorithm, a final note on memoization is in order. Much like the Fibonacci problem, the sequence alignment problem can be solved in either a top-down or bottom-up approach.

In a top-down recursive approach we can use memoization to create a potentially large dictionary indexed by each of the subproblems that we are solving (aligned sequences). This requires (Oleft(n^{2} m^{2} ight)) space if we index each subproblem by the starting and end points of the subsequences for which an optimal alignment needs to be computed. The advantage is that we solve each subproblem at most once: if it is not in the dictionary, the problem gets computed and then inserted into dictionary for further reference.

In a bottom-up iterative approach we can use dynamic programming. We define the order of computing sub-problems in such a way that a solution to a problem is computed once the relevant sub-problems have been solved. In particular, simpler sub-problems will come before more complex ones. This removes the need for keeping track of which sub-problems have been solved (the dictionary in memoization turns into a matrix) and ensures that there is no duplicated work (each sub-alignment is computed only once).

Thus in this particular case, the only practical difference between memoization and dynamic programming is the cost of recursive calls incurred in the memoization case (space usage is the same).

Problem Statement

Suppose we have an optimal alignment for two sequences (S_{1 ldots n}) and (T_{1 ldots m}) in which Si matches Tj. The key insight is that this optimal alignment is composed of an optimal alignment between (left(S_{1}, ldots, S_{i-1} ight)) and (left(T_{1}, ldots, T_{i-1} ight)) and an optimal alignment between (left(S_{i+1}, ldots, S_{n} ight)) and (left(T_{j+1}, ldots, T_{m} ight)). This follows from a cut-and-paste argument: if one of these partial alignments is suboptimal, then we cut-and-paste a better alignment in place of the suboptimal one. This achieves a higher score of the overall alignment and thus contradicts the optimality of the initial global alignment. In other words, every subpath in an optimal path must also be optimal. Notice that the scores are additive, so the score of the overall alignment equals the addition of the scores of the alignments of the subsequences. This implicitly assumes that the sub-problems of computing the optimal scoring alignments of the subsequences are independent. We need to biologically motivate that such an assumption leads to meaningful results.

Index space of subproblems

We now need to index the space of subproblems. Let (F_{i,j}) be the score of the optimal alignment of (left(S_{1}, ldots, S_{i} ight)) and (left(T_{1}, ldots, T_{i} ight)). The space of subproblems is (left{F_{i, j}, i in[0,|S|], j in[0,|T|] ight}). This allows us to maintain an ( (m + 1) × (n + 1)) matrix F with the solutions (i.e. optimal scores) for all the subproblems.

Local optimality

We can compute the optimal solution for a subproblem by making a locally optimal choice based on the results from the smaller sub-problems. Thus, we need to establish a recursive function that shows how the solution to a given problem depends on its subproblems. And we use this recursive definition to fill up the table F in a bottom-up fashion.

We can consider the 4 possibilities (insert, delete, substitute, match) and evaluate each of them based on the results we have computed for smaller subproblems. To initialize the table, we set (F_{0,j} = −j · d) and (F_{i,0} = −i·d ) since those are the scores of aligning (left(T_{1}, ldots, T_{i} ight)) with j gaps and (left(S_{1}, ldots, S_{i} ight)) with (i) gaps (aka zero overlap between the two sequences). Then we traverse the matrix column by column computing the optimal score for each alignment subproblem by considering the four possibilities:

• Sequence S has a gap at the current alignment position.

• Sequence T has a gap at the current alignment position.

• There is a mutation (nucleotide substitution) at the current position.

• There is a match at the current position.

We then use the possibility that produces the maximum score. We express this mathematically by the recursive formula for (F_{i, j}):

[F(0,0)=0 onumber]

[Initialization : F (i, 0) = F (i − 1, 0) − d onumber ]

[F (0, j) = F (0, j − 1) − d onumber ]

[ ext {Iteration}: quad F(i, j)=max left{egin{array}{l}
F(i-1, j)-d ext { insert gap in S}
F(i, j-1)-d ext{ insert gap in T}
F(i-1, j-1)+sleft(x_{i}, y_{j} ight) ext{ match or mutation}
end{array} ight. onumber ]

Termination : Bottom right

After traversing the matrix, the optimal score for the global alignment is given by (F_{m,n}). The traversal order needs to be such that we have solutions to given subproblems when we need them. Namely, to compute (F_{i,j}), we need to know the values to the left, up, and diagonally above (F_{i,j}) in the table. Thus we can traverse the table in row or column major order or even diagonally from the top left cell to the bottom right cell. Now, to obtain the actual alignment we just have to remember the choices that we made at each step.

Optimal Solution

Paths through the matrix (F) correspond to optimal sequence alignments. In evaluating each cell (F_{i,j}) we make a choice by selecting the maximum of the three possibilities. Thus the value of each (uninitialized) cell in the matrix is determined either by the cell to its left, above it, or diagonally to the left above it. A match and a substitution are both represented as traveling in the diagonal direction; however, a different cost can be applied for each, depending on whether the two base pairs we are aligning match or not. To construct the actual optimal alignment, we need to traceback through our choices in the matrix. It is helpful to maintain a pointer for each cell while filling up the table that shows which choice was made to get the score for that cell. Then we can just follow our pointers backwards to reconstruct the optimal alignment.

Solution Analysis

The runtime analysis of this algorithm is very simple. Each update takes (O(1)) time, and since there are mn elements in the matrix F, the total running time is (O(mn)). Similarly, the total storage space is O(mn). For the more general case where the update rule is more complicated, the running time may be more expensive. For instance, if the update rule requires testing all sizes of gaps (e.g. the cost of a gap is not linear), then the running time would be (O(mn(m + n))).

Needleman-Wunsch in practice

Assume we want to align two sequences S and T, where

S = AGT

T = AAGC

The first step is placing the two sequences along the margins of a matrix and initializing the matrix cells. To initialize we assign a 0 to the first entry in the matrix and then fill in the first row and column based on the incremental addition of gap penalties, as in Figure 2.9 below. Although the algorithm could fill in the first row and column through iteration, it is important to clearly define and set boundaries on the problem.

The next step is iteration through the matrix. The algorithm proceeds either along rows or along columns, considering one cell at time. For each cell three scores are calculated, depending on the scores of three adjacent matrix cells (specifically the entry above, the one diagonally up and to the left, and the one to the left). The maximum score of these three possible tracebacks is assigned to the entry and the corresponding pointer is also stored. Termination occurs when the algorithm reaches the bottom right corner. In Figure 2.10 the alignment matrix for sequences S and T has been filled in with scores and pointers.  The final step of the algorithm is optimal path traceback. In our example we start at the bottom right corner and follow the available pointers to the top left corner. By recording the alignment decisions made at each cell during traceback, we can reconstruct the optimal sequence alignment from end to beginning and then invert it. Note that in this particular case, multiple optimal pathways exist (Figure 2.11). A pseudocode implementation of the Needleman-Wunsch algorithm is included in Appendix 2.11.4

Optimizations

The dynamic algorithm we presented is much faster than the brute-force strategy of enumerating alignments and it performs well for sequences up to 10 kilo-bases long. Nevertheless, at the scale of whole genome alignments the algorithm given is not feasible. In order to align much larger sequences we can make modifications to the algorithm and further improve its performance.

Bounded Dynamic Programming

One possible optimization is to ignore Mildly Boring Alignments (MBAs), or alignments that have too many gaps. Explicitly, we can limit ourselves to stay within some distance W from the diagonal in the matrix F of subproblems. That is, we assume that the optimizing path in F from (F_{0,0} ) to ( F_{m,n} ) is within distance W along the diagonal. This means that recursion (2.2) only needs to be applied to the entries in F within distance W around the diagonal, and this yields a time/space cost of (O((m + n)W )) (refer to Figure 2.12). Note, however, that this strategy is heuristic and no longer guarantees an optimal alignment. Instead it attains a lower bound on the optimal score. This can be used in a subsequent step where we discard the recursions in matrix F which, given the lower bound, cannot lead to an optimal alignment.

Linear Space Alignment

Recursion (2.2) can be solved using only linear space: we update the columns in F from left to right during which we only keep track of the last updated column which costs (O(m)) space. However, besides the score (F_{m,n}) of the optimal alignment, we also want to compute a corresponding alignment. If we use trace back, then we need to store pointers for each of the entries in F, and this costs (O(mn)) space. It is also possible to find an optimal alignment using only linear space! The goal is to use divide and conquer in order to compute the structure of the optimal alignment for one matrix entry in each step. Figure 2.13 illustrates the process. The key idea is that a dynamic programming alignment can proceed just as easily in the reverse direction, starting at the bottom right corner and terminating at the top left. So if the matrix is divided in half, then both a forward pass and a reverse pass can run at the same time and converge in the middle column. At the crossing point we can add the two alignment scores together; the cell in the middle column with the maximum score must fall in the overall optimal path.

We can describe this process more formally and quantitatively. First compute the row index (u ∈ {1,...,m}) that is on the optimal path while crossing the ( frac{n}{2} th) column. For (1 ≤ i ≤ m ) and (n ≤ j ≤ n) let (C_{i,j}) denote the row index that is on the optimal path to (F_{i,j}) while crossing the( frac{n}{2} th) column. Then, while we update the columns of F from left to right, we can also update the columns of C from left to right. So, in (O(mn)) time and (O(m) )space we are able to compute the score (F_{m,n}) and also (C_{m,n}), which is equal to the row index ( u ∈ {1, . , m}) that is on the optimal path while crossing the ( frac{n}{2} th) column. Now the idea of divide and conquer kicks in. We repeat the above procedure for the upper left (u imes frac{n}{2} ) submatrix of F and also repeat the above procedure for the lower right ( (m-u) imes frac{n}{2} ) submatrix of F . This can be done using ( O(m + n) ) allocated linear space. The running time for the upper left submatrix is ( Oleft(frac{u n}{2} ight) ) and the running time for the lower right submatrix is (Oleft(frac{(m-u) n}{2} ight)), which added together gives a running time of ( Oleft(frac{m n}{2} ight)=O(m n)).

We keep on repeating the above procedure for smaller and smaller submatrices of F while we gather more and more entries of an alignment with optimal score. The total running time is ( O(m n)+Oleft(frac{m n}{2} ight)+Oleft(frac{m n}{4} ight)+ldots=O(2mn) = O(mn)). So, without sacrificing the overall running time (up to a constant factor), divide and conquer leads to a linear space solution (see also Section ?? on Lecture 3).

Details of Needleman–Wunsch algorithm

I try to understand details of the Needleman–Wunsch algorithm and use an example from the book [Nello Cristianini, Matthew W. Hahn. Introduction to Computational Genomics. A Case Studies Approach, www.computational-genomics.net].

To my opinion, the algorithm is described not very neatly, so the question arises. Also, the calculation of the score for diagonal neighbor is also clear, because this is an alignment without a gap:

Unclear moment is the alignment for left (or above) neighbor. Say, the left neighbor  3.1 Alignment Algorithms and Dynamic Programming

One of the first attempts to align two sequences was carried out by Vladimir Levenstein in 1965, called “edit distance”, and now is often called Levenshtein Distance. The edit distance is defined as the number of single character edits necessary” to change one word to another. Initially, he described written texts and words, but this method was later applied to biological sequences. One of the most commonly used algorithms for computing the edit distance is the Wagner-Fischer algorithm, a Dynamic Programming algorithm.

Dynamic Programming optimally phrases the full problem as the optimal solution to the smaller pieces (sub-problems). The overall problem can then be expressed as a composition of the sub-problems. In addition to the Wagner-Fischer algorithm, numerous other dynamic programming algorithms have been developed for aligning biological sequences including the Needleman-Wunsch  and Smith-Waterman Algorithms .

Sequence Alignment Algorithms

Dynamic programming algorithms are recursive algorithms modified to store intermediate results, which improves efficiency for certain problems. The Smith-Waterman (Needleman-Wunsch) algorithm uses a dynamic programming algorithm to find the optimal local (global) alignment of two sequences -- and . The alignment algorithm is based on finding the elements of a matrix where the element is the optimal score for aligning the sequence ( , . ) with ( , . ). Two similar amino acids (e.g. arginine and lysine) receive a high score, two dissimilar amino acids (e.g. arginine and glycine) receive a low score. The higher the score of a path through the matrix, the better the alignment. The matrix is found by progressively finding the matrix elements, starting at and proceeding in the directions of increasing and . Each element is set according to:

where is the similarity score of comparing amino acid to amino acid (obtained here from the BLOSUM40 similarity table) and is the penalty for a single gap. The matrix is initialized with . When obtaining the local Smith-Waterman alignment, is modified:

The gap penalty can be modified, for instance, can be replaced by , where is the penalty for a single gap and is the number of consecutive gaps.

Once the optimal alignment score is found, the ``traceback'' through along the optimal path is found, which corresponds to the the optimal sequence alignment for the score. In the next set of exercises you will manually implement the Needleman-Wunsch alignment for a pair of short sequences, then perform global sequence alignments with a computer program developed by Anurag Sethi, which is based on the Needleman-Wunsch algorithm with an affine gap penalty, , where is the extension gap penalty. The output file will be in the GCG format, one of the two standard formats in bioinformatics for storing sequence information (the other standard format is FASTA).

Manually perform a Needleman-Wunsch alignment

In the first exercise you will test the Smith-Waterman algorithm on a short sequence parts of hemoglobin (PDB code 1AOW ) and myoglobin 1 (PDB code 1AZI ).

Here you will align the sequence HGSAQVKGHG to the sequence KTEAEMKASEDLKKHGT .

 H G S A Q V K G H G 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 K -8 T -16 E -24 A -32 E -40 M -48 K -56 A -64 S -72 E -80 D -88 L -96 K -104 K -112 H -120 G -128 T -136

The first step is to fill in the similarity scores S from looking up the matches in the BLOSUM40 table, shown here labeled with 1-letter amino acid codes:

We fill in the BLOSUM40 similarity scores for you in Table 4.

, using the convention that values appear in the top part of a square in large print, and values appear in the bottom part of a square in small print. Our gap penalty is 8.

Note: we consider to be the ``predecessor'' of , since it helped decided 's value. Later, predecessors will qualify to be on the traceback path.

Table 4: Alignment score worksheet. In all alignment boxes, the similarity score from the BLOSUM40 matrix lookup is supplied (small text, bottom of square). Four alignment scores are provided as examples (large text, top of square), try and calculate at least four more, following the direction provided in the text for calculating .
 H G S A Q V K G H G 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 K -8 T -16 E -24 A -32 E -40 M -48 K -56 A -64 S -72 E -80 D -88 L -96 K -104 K -112 H -120 G -128 T -136
Table 5: Traceback worksheet. The completed alignment score matrix (large text, top of each square) with the BLOSUM40 lookup scores S (small text, bottom of each square). To find the alignment, trace back starting from the lower right (T vs G, score -21) and proceed diagonally (to the left and up), left, or up. Only proceed, however, if the square in that direction could have been a predecessor, according to the conditions described in the text.
 H G S A Q V K G H G 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 K -8 T -16 E -24 A -32 E -40 M -48 K -56 A -64 S -72 E -80 D -88 L -96 K -104 K -112 H -120 G -128 T -136

Now to check your results against a computer program. We have prepared a pairwise Needleman-Wunsch alignment program, pair , which you will apply to the same sequences which you have just manually aligned.

/tbss.work/Bioinformatics/pairData
then start the pair alignment executable by typing:
pair targlist
. All alignments will be carried out using the BLOSUM40 matrix, with a gap penalty of 8. The paths to the input files and the BLOSUM40 matrix used are defined in the file targlist the BLOSUM40 matrix is the first 25 lines of the file blosum40 . (Other substitution matrices can be found at the NCBI/Blast website.)

Note: In some installations, the pair executable is
in

/tbss.work/Bioinformatics/pairData and here you must
type ./pair targlist to run it.
If you cannot access the pair executable at all, you can see the output from this step in

Finding homologous pairs of ClassII tRNA synthetases

Homologous proteins are proteins derived from a common ancestral gene. In this exercise with the Needleman-Wunsch algorithm you will study the sequence identity of several class II tRNA synthetases, which are either from Eucarya, Eubacteria or Archaea or differ in the kind of aminoacylation reaction which they catalyze. Table 6 summarizes the reaction type, the organism and the PDB accession code and chain name of the employed Class II tRNA synthetase domains.

 Specificity Organism PDB code:chain ASTRAL catalytic domain Aspartyl Eubacteria 1EQR:B d1eqrb3 Aspartyl Archaea 1B8A:A d1b8aa2 Aspartyl Eukarya 1ASZ:A d1asza2 Glycl Archaea 1ATI:A d1atia2 Histidyl Eubacteria 1ADJ:C d1adjc2 Lysl Eubacteria 1BBW:A d1bbwa2 Aspartyl Eubacteria 1EFW:A d1efwa3

We have have prepared a computer program multiple which will align multiple pairs of proteins.

Indel is a term used for both the insertion and deletion operations in biological sequences.

Alsmirat MA, Jararweh Y, Al-Ayyoub M, Shehab MA, Gupta BB (2017) Accelerating compute intensive medical imaging segmentation algorithms using hybrid cpu-gpu implementations. Multimed Tools Appl 76(3):3537–3555

Balhaf K, Alsmirat MA, Al-Ayyoub M, Jararweh Y, Shehab MA (2017) Accelerating levenshtein and damerau edit distance algorithms using gpu with unified memory. In: 2017 8th international conference on information and communication systems (ICICS). IEEE, pp 7–11

Balhaf K, Shehab MA, Wala’a T, Al-Ayyoub M, Al-Saleh M, Jararweh Y (2016) Using gpus to speed-up levenshtein edit distance computation. In: 2016 7th international conference on information and communication systems (ICICS). IEEE, pp 80–84

Chan S, Wong A, Chiu D (1992) A survey of multiple sequence comparison methods. Bull Math Biol:563–598

Cook S (2012) CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes, p 16–25

De Oliveira Sandes E, Miranda G, Martorell X, Ayguade E, Teodoro G, Magalhaes Alves de Melo AC (2016) Cudalign 4.0: incremental speculative traceback for exact chromosome-wide alignment in GPU clusters. IEEE Trans Parallel Distrib Syst 27(10):2838–2850

Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis. Cambridge University Press, Cambridge

El-Metwally S, Ouda O, Helmy M (2014) Next generation sequencing technologies and challenges in sequence assembly. Springer Sci Bus 7:16–25

Fakirah M, Shehab MA, Jararweh Y, Al-Ayyoub M (2015) Accelerating needleman-wunsch global alignment algorithm with gpus. In: 2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA). IEEE, pp 1–5

Farrar M (2007) Striped smith-waterman speeds database searches six times over other simd implementations. Bioinformatics 23(2):156–161

Gebali F (2011) Algorithms and parallel computing, vol 84. Wiley, New York

Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162(3):705–708

Jones C, Pevzner P (2004) An introduction to bioinformatics algorithms. MIT press, Cambridge

Katoh K, Toh H (2008) Recent developments in the mafft multiple sequence alignment program. Brief Bioinform 92:86–98

Liu Y, Schmidt B (2015) Gswabe: faster gpu-accelerated sequence alignment with optimal alignment retrieval for short dna sequences. Concurr Comput: Pract Exper 27 (4):958–972

Lomont C (2011) Introduction to intel advanced vector extensions. White paper, Intel

Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

Rognes T (2011) Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinform 12:221. https://doi.org/10.1186/1471-2105-12-221

Rognes T, Seeberg E (2000) Six-fold speed-up of smith-waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics 16(8):699–706

Sanders J, Kandrot E (2010) CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional, Boston

Serrano JP, De Oliveira Sandes E, Magalhaes Alves de Melo A, Ujaldon M (2017) Smith-waterman acceleration in multi-gpus: a performance per watt analysis. In: Bioinformatics and biomedical engineering - 5th international work-conference, IWBBIO 2017, Granada, Spain, April 26–28, 2017, Proceedings, Part II, pp 512–523

Setubal J, Meidanis J (1997) Introduction to computational molecular biology. PWS Pub, Boston

Shehab MA, Al-Ayyoub M, Jararweh Y (2015) Improving fcm and t2fcm algorithms performance using gpus for medical images segmentation. In: 2015 6th international conference on information and communication systems (ICICS), pp 130–135

Siriwardena P, Ranasinghe N (2010) Accelerating global sequence alignment using cuda compatible multi-core gpu. In: 5th international conference in information and automation for sustainability (ICIAFS), pp 201–206

Vermij P (2011) Genetic sequence alignment on a supercomputing platform. Doctoral dissertation, TU Delft, Delft University of Technology

Wozniak A (1997) Using video-oriented instructions to speed up sequence comparison. Comput Appl Biosci: CABIOS 13(2):145–150

Zhou W, Zhanxiu C, Lian B, Wang J, Jianping M (2017) Protein database search of hybrid alignment algorithm based on gpu parallel acceleration. J Supercomput. https://doi.org/10.1007/s11227-017-2030-x

Zhu X, Li K, Salah A, Shi L, Li K (2015) Parallel implementation of MAFFT on cuda-enabled graphics hardware. IEEE/ACM Trans Comput Biol Bioinform 12(1):205–218. https://doi.org/10.1109/TCBB.2014.2351801

Conclusion

An efficient pairwise RNA sequence alignment heuristic, which intrinsically considers suboptimal base pairings, accurately discriminates between distinct structured RNA families. When combined with a noise-tolerant density-based clustering algorithm, this approach identifies known and novel RNA structure motifs from a set of input sequences without a priori knowledge. The resulting RNA structure motifs are subsequently used to identify homologues in the human genome, improving the annotation of lncRNAs and increasing the repertoire of functional genetic elements.

Sequence Alignment and importance:

Sequence Alignmentor sequence comparison lies at heart of the bioinformatics, which describes the way of arrangement of DNA/RNA or protein sequences, in order to identify the regions of similarity among them. It is used to infer structural, functional and evolutionary relationship between the sequences. Alignment finds similarity level between query sequence and different database sequences. The algorithm works by dynamic programming approach which divides the problem into smaller independent sub problems. It finds the alignment more quantitatively by assigning scores.

When a new sequence is found, the structure and function can be easily predicted by doing sequence alignment. Since it is believed that, a sequence sharing common ancestor would exhibit similar structure or function. Greater the sequence similarity, greater is the chance that they share similar structure or function.

Methods of Sequence Alignment:

There are mainly two methods of Sequence Alignment:

Global Alignment : Closely related sequences which are of same length are very much appropriate for global alignment. Here, the alignment is carried out from beginning till end of the sequence to find out the best possible alignment.

The Needleman-Wunsch algorithm (A formula or set of steps to solve a problem) was developed by Saul B. Needleman and Christian D. Wunsch in 1970, which is a dynamic programming algorithm for sequence alignment. The dynamic programming solves the original problem by dividing the problem into smaller independent sub problems. These techniques are used in many different aspects of computer science. The algorithm explains global sequence alignment for aligning nucleotide or protein sequences.

Local Alignment : Sequences which are suspected to have similarity or even dissimilar sequences can be compared with local alignment method. It finds local regions with high level of similarity.

These two methods of alignments are defined by different algorithms, which use scoring matrices to align the two different series of characters or patterns (sequences). The two different alignment methods are mostly defined by Dynamic programming approach for aligning two different sequences.

Dynamic Programming:

Dynamic programming is used for optimal alignment of two sequences. It finds the alignment in a more quantitative way by giving some scores for matches and mismatches (Scoring matrices), rather than only applying dots. By searching the highest scores in the matrix, alignment can be accurately obtained. The Dynamic Programming solves the original problem by dividing the problem into smaller independent sub problems. These techniques are used in many different aspects of computer science. Needleman-Wunsch and Smith-Waterman algorithms for sequence alignment are defined by dynamic programming approach.

Scoring matrices:

In optimal alignment procedures, mostly Needleman-Wunsch and Smith-Waterman algorithms use scoring system. For nucleotide sequence alignment, the scoring matrices used are relatively simpler since the frequency of mutation for all the bases are equal. Positive or higher value is assigned for a match and a negative or a lower value is assigned for mismatch. These assumption based scores can be used for scoring the matrices. There are other scoring matrices which are predefined mostly, used in the case of amino acid substitutions.

Mainly used predefined matrices are PAM and BLOSUM.

PAM Matrices: Margaret Dayhoff was the first one to develop the PAM matrix, PAM stands for Point Accepted Mutations. PAM matrices are calculated by observing the differences in closely related proteins. One PAM unit (PAM1) specifies one accepted point mutation per 100 amino acid residues, i.e. 1% change and 99% remains as such.

BLOSUM: BLOcks SUbstitution Matrix, developed by Henikoff and Henikoff in 1992, used conserved regions. These matrices are actual percentage identity values. Simply to say, they depend on similarity. Blosum 62 means there is 62 % similarity.

Gap score or gap penalty: Dynamic programming algorithms use gap penalties to maximize the biological meaning. Gap penalty is subtracted for each gap that has been introduced. There are different gap penalties such as gap open and gap extension. The gap score defines a penalty given to alignment when we have insertion or deletion. During the evolution, there may be a case where we can see continuous gaps all along the sequence, so the linear gap penalty would not be appropriate for the alignment. Thus gap open and gap extension has been introduced when there are continuous gaps (five or more). The open penalty is always applied at the start of the gap, and then the other gaps following it is given with a gap extension penalty which will be less compared to the open penalty. Typical values are –12 for gap opening, and –4 for gap extension.

Introduction

Today, the Internet of Things (IoT) network concept has become an integral part of the modern information infrastructure of society. The IoT is a single network that connects the smart objects of the real world and virtual objects around us. The number and variability of such objects (smart devices) continues to grow, affecting various aspects of human life, from vehicles to mobile electronics and healthcare . In other words, the IoT is not just a set of different devices and sensors connected by wired and wireless communications and the global cyber space of the Internet, but it is a closer integration of the real and virtual worlds, in which communication takes place between people and devices.

The sustainability of IoT is the availability of the system to perform its functions in the context of external influence . Considering that the subsystems and components of the IoT have many horizontal connections, thus forming a heterogeneous cyber infrastructure with a large number of entities, most of which lack built-in protection mechanisms the problem of network security in such networks is especially actual. Main attack vectors are aimed for overtaking the control by the IoT device and capturing the confidential data, as well as the purposeful node deactivation . Not only objects of the virtual world�ta, files, personal information𠅊re under threat, but also physical objects of the real world. According to a report by Kaspersky Laboratory, in the first half of 2019, more than 100 million attacks were detected on the smart IoT devices, which are seven times more than in the <"type":"entrez-nucleotide","attrs":<"text":"H12018","term_id":"876838","term_text":"H12018">> H12018 . Network infrastructure and wireless data transmission channels are the most vulnerable objects of the modern infocommunication environment . Attackers using wireless links can remotely invade a target subnet or device (group of devices), intercept traffic, launch denial of service attacks (including distributed attacks), and take control of the IoT devices to create megabotnets.

In 2020, the world is facing with the COVID-2019 coronavirus pandemic, and life has almost completely moved to the network, people around the world work, study, shop and have fun online. This could not affect the goals of the cybercriminals, medical organizations, educational services, delivery, etc., faced massive DDoS attacks. A large network of hospitals in France fell victim to one of these attacks, when attackers tried to disable the infrastructure of medical institutions. As a result of the attack, hospital staff working remotely was unable to use work programs and corporate e-mail for some time. In Germany, on the first day of distance learning, a distance learning platform was attacked. The service through which teachers exchange teaching materials, homework, and tests with students was not available. In the Netherlands, the food delivery service Thuisbezorgd was unable to process orders as a result of a DDoS attack and had to refund customers. In the first quarter of 2020, there was a significant increase in the number and quality of DDoS attacks. In particular, the popularity of smart attacks that exploit vulnerabilities in the network infrastructure is noted. SYN flood is the leader among the types of attacks (92.6%) the share of attacks via ICMP continued to grow and exceeded the share of UDP and TCP floods (Kaspersky .

To counter cyber attacks at the IoT, as in traditional networks, various intrusion detection systems (IDS) are applied (e.g., . There are two main approaches on which most the IoT IDSs are based: anomaly detection and misuse detection. The advantages of the former are a global view on the network, the ability to detect new attacks, and the use of fewer rules (compared to signature methods) their disadvantages include a high level of errors of the first and second kind and low adaptation for changing conditions (new profiles, dynamic anomalies). The advantages of an IDS based on search for abuse are reliability, speed, low false positives for known intrusions. Its disadvantages are the strict dependence on the relevance of the signature base and the impossibility of detecting new attacks.

The work of an IDS usually consists of three stages: (1) monitoring, (2) sequence extraction and misuse detection by pattern matching, (3) signaling of attack to provide the attack response. Matching with attack patterns is currently the main method for IDS when detecting malicious activity  however, it has two significant drawbacks. Firstly, the IoT-specific attacks, such as forced power consumption, forced topology change, etc., have a split sequence of operations, time intervals in the chain of actions, and some attacks may also have a non-linear sequence of actions. Secondly, there is a class of polymorphic attacks that have mutations (namely local differences, omissions, delays) in the sequences of actions during the implementation of the attack. These polymorphic attacks adapt to a wide range of conditions, operating systems, and circumstances and try to avoid scanning by defenses to infect endpoints . They are called the polymorphic attacks because they have different signatures that make them difficult to identify using a single signature. For example, in order to disguise ongoing network attacks, attackers can use polymorphic shellcode to create unique attack patterns. This method usually involves encoding the payload, with the decoder placed at the front of the payload before sending it. All known the pattern matching methods do not detect such variations of a single sample.

Our research proposes new, the nature inspired, technologies, namely a genetic sequence alignment computing and a sequence similarity calculation, instead of massive equity checking of multiple packets and signatures traditionally used for the IDS composing. The aim of our research is to assess the prospects of using the biological coded sequence alignment algorithms for solving the task of detection the specific network intrusions. The novelty of our results is the fact that it has been proposed to encode network packets (or the chain of activity acts) into the sequences of the nucleotide codes for their subsequent alignment and matching with patterns. The matching operation is based on the similarity checking, not equity checking, as it was usually applied for the intrusion detection. Therefore, our contribution to the cyber security is the proposed bio-inspired approach that, in contrast to traditional methods, makes it possible to detect new variations of the attacks in the IoT networks, to reduce the signature database, and to increase the detection effectiveness on the low-resource devices.

The following material is organized in the next manner: Sect.ਂ reviews successful cases of applying modern nature-inspired methods in solving cyber security issues Sect.ਃ discusses methods for aligning sequences: global, local, and additive approaches Sect.਄ presents the proposed coding and alignment technique Sect.ਅ presents our results of the experiments on detecting the DDoS attacks using a BoT-IoT dataset on the portative IoT Raspberry Pi4 platform Sect.ਆ presents a discussion of the results of applying the bio-inspired methods Sect.ਇ concludes our contribution.

Background

Biological sequence alignment is a cornerstone of bioinformatics and is widely used in such fields as phylogenetic reconstruction, gene finding, genome assembly. The accuracy of the sequence alignments and similarity measures are directly related to the accuracy of subsequent analysis. CDS alignment methods have many important applications for gene tree and protein tree reconstruction. In fact, they are useful to cluster homologous CDS into groups of orthologous splicing isoforms [1, 2] and combine partial trees on orthology groups into a complete protein tree for a gene family [3, 4]. Aligning and measuring the similarity between homologous CDS requires to account for frameshift (FS) translations that cannot be detected at the amino acid (AA) level, but lead to a high similarity at the nucleotide level between functionnaly different sub-sequences.

FS translation consists in alternative AA translations of a coding region of DNA using different translation frames . It is an important phenomenon resulting from different scenarios such as, insertion or deletion of a nucleotide sequence whose length is not a multiple of 3 in a CDS through alternative splicing [6, 7] or evolutionary genomic indels [8, 9], programmed ribosomal frameshifting , or sequencing errors . Recent studies have reported the role of FS translations in the appearance of novel CDS and functions in gene evolution [6, 12]. FS translation has also been found to be linked to several diseases such as the Crohn’s disease . The computational detection of FS translations requires the alignment of CDS while accounting for their codon structure. A classical approach for aligning two CDS used in most alignment tools [14, 15] consists in a three-step method, where the CDS are first translated into AA sequences using their actual coding frame, then AA sequences are aligned, and finally the AA alignment is back-translated to a CDS alignment. This approach does not account for alternative AA translations between two CDS and it leads to incorrect alignment of the coding regions subject to FS translation. The opposite problem of aligning protein sequences while recovering their hypothetical nucleotide CDS sequences and accounting for FS translation was also studied in several papers [16, 17].

Here, we consider the problem of aligning two CDS while accounting for FS translation, by simultaneously accounting for their nucleotide and AA sequences. The problem has recently regained attention due to the increasing evidence for alternative protein production through FS translation by eukaryotic gene families [18, 19].

The problem was first addressed by Hein et al. [20, 21] who proposed a DNA/protein model such that the score of an alignment between two CDS of length n and m is a combination of its score at the nucleotide level and its score at the AA level. They described a O(n 2 m 2 ) algorithm in , later improved to a O(nm) algorithm in  for computing an optimal score alignment, under the constraint that the search space of the problem is restricted. Arvestad  later proposed a CDS alignment scoring model with a O(nm) alignment algorithm accounting for codon structures and FS translations based on the concept of generalized substitutions introduced in . In this model, the score of a CDS alignment depends on its decomposition into a concatenation of codon fragment alignments, such that a codon fragment of a CDS is defined as a substring of length w, (0le w le 5) . This decomposition into codon fragment alignments allows to define a score of the CDS alignment at the AA level. More recently, Ranwez et al.  proposed a simplification of the model of Arvestad limiting the maximum length of a codon fragment to 3. Under this model, a O(nm) CDS alignment algorithm was described and extended in the context of multiple sequence alignment . In the models of Arvestad  and Ranwez et al. , several scores may be computed for the same alignment based on different decompositions into codon fragment alignments. The corresponding algorithms for aligning two CDS then consist in computing an optimal score decomposition of an alignment between the two CDS. This optimal score exclusively accounts for FS translation initiations, i.e a FS translation in an alignment is penalized by adding a constant FS cost, which only penalizes the initiation of the FS, not accounting for the length of this FS translation. However, taking account of FS translation lengths is important in order to increase the precision of CDS alignment scores, as these lengths induce more or less disruptions between the protein sequences.

In this paper, we propose the first alignment algorithm that accounts for both the initiation and the length of FS translations in order to compute the similarity scores of CDS alignments. The remaining of the paper is organized as follows. In the “Motivation”, we illustrate the importance of accounting for FS translation length when aligning CDS. In the “Preliminaries”, we give some preliminary definitions and we introduce a new CDS alignment scoring model with a self-contained definition of the score of an alignment penalizing both the initiation and the extension of FS translations. In the “Methods”, a dynamic programming algorithm for computing an optimal score alignment between two CDS is described. Finally, in the “Results”, we present and discuss the results of a comparison of our method with other CDS alignment methods for a pairwise comparison of CDS from homologous genes of ten mammalian gene families.

Motivation: importance of accounting for FS translation length

The two main goals of aligning biological sequences are to evaluate the similarity and to identify similar regions between the sequences, used thereafter to realize molecular analyses such as evolutionary, functional and structural predictions. In practice, CDS alignment can be used to exhaustively identify the conserved features of a set of proteins. Thus, the definition of CDS similarity must account for sequence conservation and disruptions at both the nucleotide and the protein levels.

Figure 1 illustrates the importance of accounting for AA translations and FS translation length in order to compute an adequate similarity score for a CDS alignment. It describes an example of three CDS Seq1 , Seq2 and Seq3 . Seq1 has a length of 45. The CDS Seq2 has length 60 and is obtained from Seq1 by deleting the nucleotide ‘G’ at position 30 and adding 16 nucleotides at the end. The CDS Seq3 has length 60 and is obtained from Seq1 by deleting the nucleotide ‘G’ at position 15 and adding 16 nucleotides at the end.

Top an example of three CDS Seq1 , Seq2 and Seq3 . Middle an optimal alignment between Seq1 and Seq2 with a FS translation region of length 15. Bottom an optimal alignment between Seq1 and Seq3 with a FS translation region of length 30

When looking at the AA translations of Seq1 , Seq2 and Seq3 , we observe that the similarity between Seq2 and Seq1 is higher than the similarity between Seq3 and Seq1 at the protein level, because Seq1 and Seq2 share a longer AA prefix “M T E S K Q P W H” (amino acids in black characters in the alignments). However, the pairwise CDS alignment algorithms that do not account for the length of FS translations would return the same score for the two following optimal alignments of Seq1 with Seq2 and Seq1 with Seq3 , penalizing only the initiation of one FS translation in both cases (positions marked with a “!” symbol in the alignments), and not penalizing the sequence disruptions at the protein level.

From an evolutionary point of view, a good scoring model for evaluating the similarity between two CDS in the presence of FS translations should then penalize not only the initiation of FS but also the length of FS translations extension (amino acids in gray characters in the alignments). The alignment of Seq1 with Seq2 would then have a higher similarity score than the alignment of Seq1 with Seq3 .

Preliminaries: score of CDS alignment

In this section, we formally describe a new definition of the score of a CDS alignment that penalizes both the initiation and the extension of FS translations.

Definition 1

[Coding DNA sequence (CDS)] A coding DNA sequence (CDS) is a DNA sequence on the alphabet of nucleotides (Sigma _N=) whose length n is a multiple of 3. A coding sequence is composed of a concatenation of (frac<3>) codons that are the words of length 3 in the sequence ending at positions 3i, (1 le i le frac<3>) . The AA translation of a CDS is a protein sequence of length (frac<3>) on the alphabet (Sigma _A) of AA such that each codon of the CDS is translated into an AA symbol in the protein sequence.

Note that, in practice an entire CDS begins with a start codon “ATG” and ends with a stop codon “TAA” , “TAG” or “TGA” .

Definition 2

(Alignment between DNA sequences) An alignment between two DNA sequences A and B is a pair ((A',B')) where (A') and (B') are two sequences of same length L derived by inserting gap symbols ('-') in A and B, such that (forall i,

1 le i le L) , in the alignment is called a column of the alignment.

Given an alignment ((A',B')) of length L between two CDS A and B, let S be the sequence (A') or (B') . We denote by (S[k

1 le k le l le L) , the substring of S going from position k to position l. (left| ight|) denotes the number of letters in () that are different from the gap symbol ('-') . For example, if (A'= exttt) and (B'= exttt) , (|A'[4

8]| = | exttt| = 3) . A codon of A or B is grouped in the alignment ((A',B')) if its three nucleotides appear in three consecutive columns of the alignment. For example, the first codon ACC of A is grouped, while the first codon ACT of B is not grouped.

An alignment ((A',B')) of length 48 between two CDS, A (13 codons) and B (14 codons). The number arrays indicate the positions of the consecutive alignment columns. Codons of A and B are colored according to the set to which they belong: IM codons in blue color, FSext codons in red color, InDel codons in green color and FSinit codons in black color. MFS nucleotides contained in FSinit codons are underlined

In the following, we give our definition of the score of an alignment ((A',B')) between two CDS A and B. It is based on a partition of the codons of A (resp. B) into four sets depending on the alignment of codons (see Fig. 2 for an illustration):

The set of In-frame Matching codons (IM) contains the codons that are grouped in the alignment and aligned with a codon of the other CDS.

The set of Frameshift extension codons (FSext) contains the codons that are grouped in the alignment and aligned with a concatenation of three nucleotides that overlaps two codons of the other CDS.

The set of Deleted/Inserted codons (InDel) contains the codons that are grouped in the alignment and aligned with a concatenation of 3 gap symbols.

All other codons constitutes Frameshift initiation codons (FSinit) . The set of Matching nucleotides in FSinit codons (MFS) contains all the nucleotides belonging to FSinit codons and aligned with a nucleotide of the other CDS.

The following notations and conventions are used in Definition 3 to denote the different sets of codons and nucleotides in A and B. The set of IM codons in A (resp. B) is denoted by ( exttt_) (resp. ( exttt_) ). The set of FSext codons in A (resp. B) is denoted by ( exttt_) (resp. ( exttt_) ). The set of InDel codons in A (resp. B) is denoted by ( exttt_) (resp. ( exttt_) ). The set of MFS nucleotides in A (resp. B) is denoted by ( exttt_) (resp. ( exttt_) ). In these sets, the codons of A and B are simply identified by the position (column) of their last nucleotide in the alignment. In this case, we always have ( exttt_ = exttt_) as in the example below. The MFS nucleotides are also identified by their positions in the alignment.

In the alignment scoring model described in Definition 3, the substitutions of IM and FSext codons are scored using an AA scoring function (s_) such that aligned codons with silent nucleotide mutations get the same score as identity. A fixed FS extension cost denoted by fs_extend_cost is added for each FSext codon. The insertions/deletions of InDel codons are scored by adding a fixed gap cost denoted by gap_cost for each InDel codon. The alignment of MFS nucleotides are scored independently from each other, using a nucleotide scoring function (s_) . The insertions or deletions of nucleotides in FSinit codons are responsible for the initiation of FS translations. They are then scored by adding a fixed FS opening cost denoted by fs_open_cost for each FSinit codon. Note that, by convention, the values of all penalty costs for gap and FS ( gap_cost , fs_open_cost , fs_extend_cost ) are negative. Note also that the scoring scheme assumes that the AA and the nucleotide scoring functions, (s_) and (s_) , are symmetric.

Definition 3

(Score of an alignment) Let ((A',B')) be an alignment of length L between two CDS A and B. The score of the alignment ((A',B')) is defined by:

Title: Pocket arithmetic and Needleman-Wunsch on Cray Research computers

In the interest of providing more efficient computer-based analysis of DNA and protein sequences, Cray Research has developed a high performance implementation of the sequence alignment method of Needleman and Wunsch using the programming technique of pocket arithmetic. The basis for this implementation is the program SEQHDP, which finds locally homologous subsequences of a protein sequence pair and determines the statistical significance of the homology. Pocket arithmetic takes advantage of the 64-bit width of an operand on the Cray Y-MP by packing more than one integer value per word, then performing logical or integer operations on the packed word to yield multiple results per operation. This technique, in combination with the vector processing capabilities of the Cray Y-MP CPU, produces substantially improved performance over the conventionally coded version of the same algorithm. The authors will introduce the programming technique of pocket arithmetic, then describe its implementation in the Needleman-Wunsch sequence comparison function in SEQHDP. Performance results based on actual protein sequence comparisons are presented.