Presented here are notes on sequence alignment. First I will go over pairwise sequence alignment, both local and global. Next we will go over multiple sequence alignment.
Pairwise sequence alignment
The goal of pairwise sequence alignment is to compare two sequences and align them to see if they are similar or to ask which parts are similar. We have two sequences:
sequence 1: ACCGTTG sequence 2: CCGTGA
and we expect that after aligning we might get something like
sequence 1: ACCGTTG sequence 2: CCGTGA
or
sequence 1: ACCGTTG sequence 2: CCGTGA
Either way, we don’t expect that something like
sequence 1: ACCGTTG sequence 2: CCGTGA
Why not? What allows us to distinguish between these different alignments? We have some way to assign scores to each and to choose the one with the best score. Here we will start with global pairwise sequence alignment where we want to align the complete sequences from two sequences.
Global pairwise sequence alignment
The global pairwise sequence alignment that we will talk about will be the NeedlemanWunschGotoh algorithm. This is an extension of the original NeedlemanWunsch (1970) with a speedup by Gotoh (1982). I urge you to follow through to the end before deciding that you don’t understand. It takes a little bit of walking through to understand what is going on.
In order to calculate the alignment with the best score, we need some sort of scoring scheme that says what the score of a match or missmatch is.
Let’s start with two sequence strings that we want to align
ACCGTTG CCGTGA
We want to align these two sequence strings and to do so we are going to calculate a matrix that gives the score for each position. This way we can find the one with the highest score and that will be our best global alignment.
The idea would be to fill this number in for a matrix that presents our two sequences. That matrix is first going to be filled with zeros like this
Note the extra zero at 0,0. This is because we are going to fill in this matrix by starting at index1 on the row and index1 in the column. Then to fill in the next cell, we ask what the numbers of the cell to the diagonal, above and to the left are and calculate the current cell. The other thing that we need in addition to a scoring matrix (you can find those here ftp://ftp.ncbi.nih.gov/blast/matrices/) is the cost for adding a gap. Let’s say that is 5
for now. You can see we are just adding the gap penalty (5
along the edges).
We can go ahead and fill out what the gaps look like at the edges of the matrix.
Now lets look at what the whole scoring for this matrix looks like
The result of walking back (traceback as it is often referred) is this alignment:
unalign: ACCGTTG unalign: CCGTGA aligned: ACCGTTG aligned: CCGTGA
The code below implements the algorithm discussed above and can be found in the file pairwise_alignment.py
. There is also code here for reading the scoring matrices in NCBI format.
def nwg_pairwise(seq1,seq2,gap_penalty,scoringmatrix):
seq1 = seq1.upper()
seq2 = seq2.upper()
F = zeros((len(seq2)+1,len(seq1)+1))
d = gap_penalty
for i in range(len(seq1)+1):
F[0][i] = d*i
for j in range(len(seq2)+1):
F[j][0] = d*j
for i in range(1,len(seq1)+1):
for j in range(1,len(seq2)+1):
match = F[j1][i1] + scoringmatrix[seq1[i1]][seq2[j1]]
delete = F[j1][i] + d
insert = F[j][i1] + d
F[j][i] = max(match,delete,insert)
aln1 = ""
aln2 = ""
i = len(seq1)
j = len(seq2)
print "unalign: "+seq1
print "unalign: "+seq2
while( i > 0 and j > 0):
score = F[j][i]
scorediag = F[j1][i1]
scoreup = F[j][i1]
scoreleft = F[j1][i]
if (score == scorediag + scoringmatrix[seq1[i1]][seq2[j1]]):
aln1 += seq1[i1]
aln2 += seq2[j1]
i = i  1
j = j  1
elif (score == scoreup+d):
aln1 += seq1[i1]
aln2 += ""
i = i  1
elif (score == scoreleft + d):
aln1 += ""
aln2 += seq2[j1]
j = j  1
while (i > 0):
aln1 += seq1[i1]
aln2 += ""
i = i  1
while(j > 0):
aln1 += ""
aln2 += seq2[j1]
j = j 1
print "aligned: "+aln1[::1]
print "aligned: "+aln2[::1]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
def nwg_pairwise(seq1,seq2,gap_penalty,scoringmatrix):
seq1 = seq1.upper()
seq2 = seq2.upper()
F = zeros((len(seq2)+1,len(seq1)+1))
d = gap_penalty
for i in range(len(seq1)+1):
F[0][i] = d*i
for j in range(len(seq2)+1):
F[j][0] = d*j
for i in range(1,len(seq1)+1):
for j in range(1,len(seq2)+1):
match = F[j1][i1] + scoringmatrix[seq1[i1]][seq2[j1]]
delete = F[j1][i] + d
insert = F[j][i1] + d
F[j][i] = max(match,delete,insert)
aln1 = ""
aln2 = ""
i = len(seq1)
j = len(seq2)
print "unalign: "+seq1
print "unalign: "+seq2
while( i > 0 and j > 0):
score = F[j][i]
scorediag = F[j1][i1]
scoreup = F[j][i1]
scoreleft = F[j1][i]
if (score == scorediag + scoringmatrix[seq1[i1]][seq2[j1]]):
aln1 += seq1[i1]
aln2 += seq2[j1]
i = i  1
j = j  1
elif (score == scoreup+d):
aln1 += seq1[i1]
aln2 += ""
i = i  1
elif (score == scoreleft + d):
aln1 += ""
aln2 += seq2[j1]
j = j  1
while (i > 0):
aln1 += seq1[i1]
aln2 += ""
i = i  1
while(j > 0):
aln1 += ""
aln2 += seq2[j1]
j = j 1
print "aligned: "+aln1[::1]
print "aligned: "+aln2[::1]

This algorithm runtime scales in term so of the sequences lengths of the two sequences. In O notation, we would say that it is quadratic and runs at O(MN) where M and N are the two sequence lengths. Here is a graph of what that looks like on my computer with the expected and observed run times.
Next up we will talk about local pairwise alignment with the SmithWaterman method.