Parsimony

Parsimony is a very popular and simple approach for calculating scores for trees. It is also the first method that we encounter that isn’t an exact method, but instead requires a heuristic for estimating the best tree. For more information about how we search through tree space see the page on tree searching. Another aspect of the remaining methods that differs from neighbor joining and UPGMA is that parsimony and likelihood methods are site based. So we calculate the score of each site and then combine those results as opposed to calculating distances over the entire sequence.

We will discuss two ways to calculate a parsimony score: Fitch and Sankoff.

Fitch algorithm

From my perspective, the Fitch algorithm is one of the simplest methods that we will encounter in these pages. For that reason, I think it is a lot of fun. Let us say that we have a tree and we have one site in an alignment. For the alignment, lets assume something like this where one site in the alignment would be:

species1 G
species2 G
species3 C
species4 A
species5 A

Also assume that we are given a tree, let’s say a starting tree. Here we have the starting tree with the characters aligned on the tips.

                        ------------+ species1_G 
            ------------+                        
------------+           ------------+ species2_G 
:           :                                    
+           ------------------------+ species3_C 
:                                                
:                 ------------------+ species4_A 
------------------+                              
                  ------------------+ species5_A

The Fitch algorithm proceeds as follows, starting at the tips and moving to the root, we calculate the intersection of the characters at the daughter nodes. If the intersection is nothing (there are no shared characters), then we take the union (all the characters) and add a cost. So for a node, we would look at the children and ask what the nucleotides are for the children. Say one child has the nucleotide A and the other has C, the intersection would be nothing. So we would increase the cost by one and store at the node the union which would be [AC]. If we were at an internal node and we looked at the children, lets say one child had [AC] and the other had C, the intersection would be [C] and the cost would not increase. We proceed to do this from tips to the root, calculating the cost as we go. For the above tree, that would look like this (with the internal nodes showing the sets that result from the processing of daughter nodes.

                         -----------+ species1_G 
              ----------G+                       
   ---------GC+          -----------+ species2_G 
   :          :                                  
GCA+          ----------------------+ species3_C 
   :                                             
   :               -----------------+ species4_A 
   ---------------A+                             
                   -----------------+ species5_A
cost: 3

The movement from the tips to root is done by something called a postorder traversal. Postorder traversals are one of the two common ways that we will be moving around the tree space. The other way is with a preorder traversal which instead of from the tips to the root, we would go from the root to the tips.

OK, here is the entire python code in the file tree_parsimony_calculator.py in the function calc_fitch_cost(). In this function, I store the set of nucleotides that are at each node in a dictionary.


def calc_fitch_cost(tree,seqs):
    #first check to make sure that the seqs match the tips
    match_tips_and_seqs(tree,seqs)
    sites = len(seqs[0].seq)
    #now calculate the cost
    totalcost = 0
    for i in range(sites):
        #calculate cost for each site
        cost = 0
        #loop every node from tip to root
        for j in tree.iternodes(order="postorder"):
           #if it is an internal node
            if len(j.children) < 0:
                #get date for each child
                child1 = j.children[0].data['positions']
                child2 = j.children[1].data['positions']
                #tset is the intersection
                tset = set(child1).intersection(set(child2))
                #if the intersection is nothing
                if len(tset) == 0:
                    #cost goes up, take union
                    cost += 1
                    tset = set(child1).union(set(child2))
                #store the set at the node
                j.data['positions'] = tset
            else:
                #if it is a tip
                #tset is the nucleotide for the tip
                tset = position[j.data['seq'][i]]
                #store it at the node in the tree
                j.data['positions'] = tset
        #add the cost of the site to the total
        totalcost += cost
    return totalcost
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def calc_fitch_cost(tree,seqs):
    #first check to make sure that the seqs match the tips
    match_tips_and_seqs(tree,seqs)
    sites = len(seqs[0].seq)
    #now calculate the cost
    totalcost = 0
    for i in range(sites):
        #calculate cost for each site
        cost = 0
        #loop every node from tip to root
        for j in tree.iternodes(order="postorder"):
           #if it is an internal node
            if len(j.children) < 0:
                #get date for each child
                child1 = j.children[0].data['positions']
                child2 = j.children[1].data['positions']
                #tset is the intersection
                tset = set(child1).intersection(set(child2))
                #if the intersection is nothing
                if len(tset) == 0:
                    #cost goes up, take union
                    cost += 1
                    tset = set(child1).union(set(child2))
                #store the set at the node
                j.data['positions'] = tset
            else:
                #if it is a tip
                #tset is the nucleotide for the tip
                tset = position[j.data['seq'][i]]
                #store it at the node in the tree
                j.data['positions'] = tset
        #add the cost of the site to the total
        totalcost += cost
    return totalcost

We would calculate the score for each site for a tree, and then we would rearrange the branches on the tree and calculate the score again and see if the new tree got a better score. If it did, we would prefer that tree and rearrange again. We would continue this until we have sufficiently searched the tree space.

Sankoff algorithm

There is an alternative to the Fitch algorithm for parsimony called the Sankoff algorithm. This allows for a more complicated model where the cost of change between different nucleotides is not the same. As a result, instead of calculating sets at each internal node, we have to keep track of the costs for each nucleotide.

If we assume a cost matrix of something like

  | A  C  G  T
--|------------
A | 0  2  1  3
C | 2  0  2  1
G | 1  2  0  2
T | 3  1  2  0

With this matrix, the cost of going from A to C would be 2 and the cost of staying an A would be 0. If we take the same example from above with the Sankoff algorithm, the difference is that we have to store the results for A, C, G, T at each node. We are going to have it look something like this in the tree

                         -----------+ species1_G [A:∞ C:∞ G:0 T:∞]
              -----------+                       
   -----------+          -----------+ species2_G [A:∞ C:∞ G:0 T:∞]
   :          :                                  
   +          ----------------------+ species3_C [A:∞ C:0 G:∞ T:∞]
   :                                             
   :               -----------------+ species4_A [A:0 C:∞ G:∞ T:∞]
   ---------------+                             
                   -----------------+ species5_A [A:0 C:∞ G:∞ T:∞]

That is how the tips are calculated and now we need to move from the tips to the root and calculate what the values would be at each internal node. If we look at the internal node between species1 and species2, we have to calculate the cost of going from an A to the G in species1 and add that to the cost of going from an A to species2 (which would be 1+1=2), and the C to the G in species1 and species2 (2+2=4) and so on. We will store that at the internal node.

                         -----------+ species1_G [A:∞ C:∞ G:0 T:∞]
              -----------+ [A:2 C:4 G:0 T:4]                      
   -----------+          -----------+ species2_G [A:∞ C:∞ G:0 T:∞]
   :          :                                  
   +          ----------------------+ species3_C [A:∞ C:0 G:∞ T:∞]
   :                                             
   :               -----------------+ species4_A [A:0 C:∞ G:∞ T:∞]
   ---------------+                             
                   -----------------+ species5_A [A:0 C:∞ G:∞ T:∞]

Technically, we are calculating

S_{a}(i)= min_{j}[c_{ij}+S_{l}(j)]+min_{k}[c_{ik}+S_{r}(k)]

where S_{a}(i) is the smallest cost for node a for state i. We want the minimum cost of going from state i to state j on the left descendant and k on the right descendant. We will pick the j and k that minimizes the cost.

For the next internal node moving back, we calculate the cost of going from an A, C, G, and T to the minimum at each other child. That would look like

                         -----------+ species1_G [A:∞ C:∞ G:0 T:∞]
              -----------+ [A:2 C:4 G:0 T:4]                      
   -----------+          -----------+ species2_G [A:∞ C:∞ G:0 T:∞]
   :          : [A:3 C:4 G:2 T:5]                                 
   +          ----------------------+ species3_C [A:∞ C:0 G:∞ T:∞]
   :                                             
   :               -----------------+ species4_A [A:0 C:∞ G:∞ T:∞]
   ---------------+                             
                   -----------------+ species5_A [A:0 C:∞ G:∞ T:∞]

To fill out the whole tree it would look like

                         -----------+ species1_G [A:∞ C:∞ G:0 T:∞]
              -----------+ [A:2 C:4 G:0 T:4]                      
   -----------+          -----------+ species2_G [A:∞ C:∞ G:0 T:∞]
   :          : [A:3 C:4 G:2 T:5]                                 
   +          ----------------------+ species3_C [A:∞ C:0 G:∞ T:∞]
   : [A:3 C:6 G:4 T:7]                                            
   :               -----------------+ species4_A [A:0 C:∞ G:∞ T:∞]
   ---------------+ [A:0 C:4 G:2 T:6]                            
                   -----------------+ species5_A [A:0 C:∞ G:∞ T:∞]

The code for this looks a little more complex than the code


def calc_sankoff_cost(tree,seqs,cost_matrix_dict):
    ret = match_tips_and_seqs(tree,seqs)
    if ret == False:
        return
    sites = len(seqs[0].seq)
    totalcost = 0
    for i in range(sites):
        for j in tree.iternodes(order="postorder"):
            if len(j.children) < 0:
                child1 = j.children[0].data['scores']
                child2 = j.children[1].data['scores']
                nddict = {'A':float("inf"),'C':float("inf"),'G':float("inf"),'T':float("inf")}
                for s in nddict:
                    minh = float("inf")
                    for h in child1:
                        tscore = cost_matrix_dict[s][h] + child1[h]
                        if tscore < minh:
                            minh = tscore
                    mink = float("inf")
                    for k in child2:
                        tscore = cost_matrix_dict[s][k] + child2[k]
                        if tscore < mink:
                            mink = tscore
                    nddict[s] = minh+mink
                j.data['scores'] = nddict
            else:
                nddict = {'A':float("inf"),'C':float("inf"),'G':float("inf"),'T':float("inf")}
                tset = position[j.data['seq'][i]]
                for k in tset:
                    nddict[k] = 0
                j.data['scores'] = nddict
        mint = float("inf")
        for j in tree.data['scores']:
            if tree.data['scores'][j] < mint:
                mint = tree.data['scores'][j]
        totalcost += mint
    return totalcost
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def calc_sankoff_cost(tree,seqs,cost_matrix_dict):
    ret = match_tips_and_seqs(tree,seqs)
    if ret == False:
        return
    sites = len(seqs[0].seq)
    totalcost = 0
    for i in range(sites):
        for j in tree.iternodes(order="postorder"):
            if len(j.children) < 0:
                child1 = j.children[0].data['scores']
                child2 = j.children[1].data['scores']
                nddict = {'A':float("inf"),'C':float("inf"),'G':float("inf"),'T':float("inf")}
                for s in nddict:
                    minh = float("inf")
                    for h in child1:
                        tscore = cost_matrix_dict[s][h] + child1[h]
                        if tscore < minh:
                            minh = tscore
                    mink = float("inf")
                    for k in child2:
                        tscore = cost_matrix_dict[s][k] + child2[k]
                        if tscore < mink:
                            mink = tscore
                    nddict[s] = minh+mink
                j.data['scores'] = nddict
            else:
                nddict = {'A':float("inf"),'C':float("inf"),'G':float("inf"),'T':float("inf")}
                tset = position[j.data['seq'][i]]
                for k in tset:
                    nddict[k] = 0
                j.data['scores'] = nddict
        mint = float("inf")
        for j in tree.data['scores']:
            if tree.data['scores'][j] < mint:
                mint = tree.data['scores'][j]
        totalcost += mint
    return totalcost

An interesting aspect of Sankoff is that with a cost matrix of everything costing 1, you will basically get the same cost as the Fitch. It has lead some to think that the Fitch is a special case of the Sankoff (where all costs are 1) but that is not strictly true.

Problems with parsimony

There are some issues with parsimony that should be discussed. These are especially relevant as we begin to look at molecular models. One of the problems has to do with statistical consistency.

Leave a Reply