Alignments

Presented here are notes on sequence alignment. First I will go over pairwise sequence alignment, both local and global. Next we will go over multiple sequence alignment.

Pairwise sequence alignment

The goal of pairwise sequence alignment is to compare two sequences and align them to see if they are similar or to ask which parts are similar. We have two sequences:

sequence 1: ACCGTTG
sequence 2: CCGTGA

and we expect that after aligning we might get something like

sequence 1: ACCGTTG-
sequence 2: -CCGT-GA

or

sequence 1: ACCGTTG-
sequence 2: -CCG-TGA

Either way, we don’t expect that something like

sequence 1: ACCGTTG------
sequence 2: -------CCGTGA

Why not? What allows us to distinguish between these different alignments? We have some way to assign scores to each and to choose the one with the best score. Here we will start with global pairwise sequence alignment where we want to align the complete sequences from two sequences.

Global pairwise sequence alignment

The global pairwise sequence alignment that we will talk about will be the Needleman-Wunsch-Gotoh algorithm. This is an extension of the original Needleman-Wunsch (1970) with a speedup by Gotoh (1982). I urge you to follow through to the end before deciding that you don’t understand. It takes a little bit of walking through to understand what is going on.

In order to calculate the alignment with the best score, we need some sort of scoring scheme that says what the score of a match or missmatch is.

Let’s start with two sequence strings that we want to align

ACCGTTG
CCGTGA

We want to align these two sequence strings and to do so we are going to calculate a matrix that gives the score for each position. This way we can find the one with the highest score and that will be our best global alignment.

F(i,j)=max\begin{cases}F(i-1,j-1)+s(x_{i},y_{j}),\\F(i-1,j)-d,\\F(i,j-1)-d,\end{cases}

The idea would be to fill this number in for a matrix that presents our two sequences. That matrix is first going to be filled with zeros like this

\begin{matrix} & & A & C & C & G & T & T & G \\ & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\C & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\C & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\G & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\T & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\G & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\A & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\\end{matrix}

Note the extra zero at 0,0. This is because we are going to fill in this matrix by starting at index-1 on the row and index-1 in the column. Then to fill in the next cell, we ask what the numbers of the cell to the diagonal, above and to the left are and calculate the current cell. The other thing that we need in addition to a scoring matrix (you can find those here ftp://ftp.ncbi.nih.gov/blast/matrices/) is the cost for adding a gap. Let’s say that is -5 for now. You can see we are just adding the gap penalty (-5 along the edges).

We can go ahead and fill out what the gaps look like at the edges of the matrix.

\begin{matrix} & & A & C & C & G & T & T & G \\ & 0.0 & -5.0 & -10.0 & -15.0 & -20.0 & -25.0 & -30.0 & -35.0 \\C & -5.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\C & -10.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\G & -15.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\T & -20.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\G & -25.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\A & -30.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 & 0.0 \\\end{matrix}

Now lets look at what the whole scoring for this matrix looks like

\begin{matrix} & & A & C & C & G & T & T & G \\ & 0.0 & -5.0 & -10.0 & -15.0 & -20.0 & -25.0 & -30.0 & -35.0 \\C & -5.0 & -4.0 & 0.0 & -5.0 & -10.0 & -15.0 & -20.0 & -25.0 \\C & -10.0 & -9.0 & 1.0 & 5.0 & 0.0 & -5.0 & -10.0 & -15.0 \\G & -15.0 & -14.0 & -4.0 & 0.0 & 10.0 & 5.0 & 0.0 & -5.0 \\T & -20.0 & -19.0 & -9.0 & -5.0 & 5.0 & 15.0 & 10.0 & 5.0 \\G & -25.0 & -24.0 & -14.0 & -10.0 & 0.0 & 10.0 & 11.0 & 15.0 \\A & -30.0 & -20.0 & -19.0 & -15.0 & -5.0 & 5.0 & 6.0 & 10.0 \\\end{matrix}

The result of walking back (traceback as it is often referred) is this alignment:

unalign: ACCGTTG
unalign: CCGTGA
aligned: ACCGTTG-
aligned: -CCG-TGA

The code below implements the algorithm discussed above and can be found in the file pairwise_alignment.py. There is also code here for reading the scoring matrices in NCBI format.


def nwg_pairwise(seq1,seq2,gap_penalty,scoringmatrix):
    seq1 = seq1.upper()
    seq2 = seq2.upper()
    F = zeros((len(seq2)+1,len(seq1)+1))
    d = gap_penalty
    for i in range(len(seq1)+1):
        F[0][i] = d*i
    for j in range(len(seq2)+1):
        F[j][0] = d*j
    for i in range(1,len(seq1)+1):
        for j in range(1,len(seq2)+1):
            match = F[j-1][i-1] + scoringmatrix[seq1[i-1]][seq2[j-1]]
            delete = F[j-1][i] + d
            insert = F[j][i-1] + d
            F[j][i] = max(match,delete,insert)
    aln1 = ""
    aln2 = ""
    i = len(seq1)
    j = len(seq2)
    print "unalign: "+seq1
    print "unalign: "+seq2
    while( i > 0 and j > 0):
        score = F[j][i]
        scorediag = F[j-1][i-1]
        scoreup = F[j][i-1]
        scoreleft = F[j-1][i]
        if (score == scorediag + scoringmatrix[seq1[i-1]][seq2[j-1]]):
            aln1 += seq1[i-1]
            aln2 += seq2[j-1]
            i = i - 1
            j = j - 1
        elif (score == scoreup+d):
            aln1 += seq1[i-1]
            aln2 += "-"
            i = i - 1
        elif (score == scoreleft + d):
            aln1 += "-"
            aln2 += seq2[j-1]
            j = j - 1
    while (i > 0):
        aln1 += seq1[i-1]
        aln2 += "-" 
        i = i - 1
    while(j > 0):
        aln1 += "-"
        aln2 += seq2[j-1]
        j = j -1
    print "aligned: "+aln1[::-1]
    print "aligned: "+aln2[::-1]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def nwg_pairwise(seq1,seq2,gap_penalty,scoringmatrix):
    seq1 = seq1.upper()
    seq2 = seq2.upper()
    F = zeros((len(seq2)+1,len(seq1)+1))
    d = gap_penalty
    for i in range(len(seq1)+1):
        F[0][i] = d*i
    for j in range(len(seq2)+1):
        F[j][0] = d*j
    for i in range(1,len(seq1)+1):
        for j in range(1,len(seq2)+1):
            match = F[j-1][i-1] + scoringmatrix[seq1[i-1]][seq2[j-1]]
            delete = F[j-1][i] + d
            insert = F[j][i-1] + d
            F[j][i] = max(match,delete,insert)
    aln1 = ""
    aln2 = ""
    i = len(seq1)
    j = len(seq2)
    print "unalign: "+seq1
    print "unalign: "+seq2
    while( i > 0 and j > 0):
        score = F[j][i]
        scorediag = F[j-1][i-1]
        scoreup = F[j][i-1]
        scoreleft = F[j-1][i]
        if (score == scorediag + scoringmatrix[seq1[i-1]][seq2[j-1]]):
            aln1 += seq1[i-1]
            aln2 += seq2[j-1]
            i = i - 1
            j = j - 1
        elif (score == scoreup+d):
            aln1 += seq1[i-1]
            aln2 += "-"
            i = i - 1
        elif (score == scoreleft + d):
            aln1 += "-"
            aln2 += seq2[j-1]
            j = j - 1
    while (i > 0):
        aln1 += seq1[i-1]
        aln2 += "-" 
        i = i - 1
    while(j > 0):
        aln1 += "-"
        aln2 += seq2[j-1]
        j = j -1
    print "aligned: "+aln1[::-1]
    print "aligned: "+aln2[::-1]

This algorithm runtime scales in term so of the sequences lengths of the two sequences. In O notation, we would say that it is quadratic and runs at O(MN) where M and N are the two sequence lengths. Here is a graph of what that looks like on my computer with the expected and observed run times.

nw_time

Next up we will talk about local pairwise alignment with the Smith-Waterman method.

Pages: 1 2 3

Leave a Reply