Molecular Models

The model that is used for parsimony is that the simplest explanation is the correct one. Also, parsimony assumes that there is no convergence. In other words, the model does not consider

A---->C------>A


This consideration can be taken into account with the use of models. Here we will discuss some of the issues related to models and we will introduce the concept of calculating likelihood.

Calculating distance with multiple mutations

When you calculate the distance between two sequences, you can just count the number of changes. You can also calculate the proportion of differences. So for the two sequences

ACTCAACCTTCCAGC
ACCCAGACTTCTAGC


There are 4 changes and therefore 4/15 = 0.26666 proportional changes. This could be thought of as a branch length between sequence 1 and sequence 2. However, because of the possibility for multiple mutations at a site, this might be an underestimate.

In order to account for this, we need some sort of model. The simplest model that we could use is called the Jukes-Cantor.

Jukes-Cantor
The Jukes-Cantor model considers that all nucleotides all have the same frequency and that all the transitions are equally probable. Given this model, if we want to calculate the ‘corrected’ distance that accommodates for the multiple substitutions at each site, we can calculate

$d_{xy}=-(\frac{3}{4})ln(1-\frac{4}{3}D)$

where x and y are the two sequences and D is the proportional difference as calculated above.

How do we get to this? Well, let’s look at this model in a different way. If we have equal probability to change between each nucleotide, we can describe the substitution rate matrix, or the probability of a nucleotide changing to another nucleotide over an infinitesimally small amount of time. This matrix for the Jukes-Cantor model

$Q=\left[\begin{array}{cccc}-3\lambda & \lambda & \lambda & \lambda\\\lambda & -3\lambda & \lambda & \lambda\\\lambda & \lambda & -3\lambda & \lambda\\\lambda & \lambda & \lambda & -3\lambda\end{array}\right]$

where the rows correspond to A,C,G,T and the columns correspond to A,C,G,T. The lambda would correspond to the rate. There is only