Maximum Likelihood


Maximum Likelihood is a method for the inference of phylogeny. It evaluates a hypothesis about evolutionary history in terms of the probability that the proposed model and the hypothesized history would give rise to the observed data set. The supposition is that a history with a higher probability of reaching the observed state is preferred to a history with a lower probability. The method searches for the tree with the highest probability or likelihood.


Programs

The Maximum Likelihood method of inference is available for both nucleic acid and protein data. The following programs are available from the web:


Advantages and disadvantages of maximum likelihood methods:

 



Explication of the method

Maximum likelihood evaluates the probability that the choosen evolutionary model will have generated the observed sequences. Phylogenies are then inferred by finding those trees that yield the highest likelihood.

Assume that we have the aligned nucleotide sequences for four taxa:

           1         j       ....N

 (1)       A G G C U C C A A ....A       

 (2)       A G G U U C G A A ....A      

 (3)       A G C C C A G A A.... A      

 (4)       A U U U C G G A A.... C   


and we want to evauate the likelihood of the unrooted tree represented by the nucleotides of site j in the sequence and shown below:

 

         (1)          (2) 
           \          /               
            \        /
              ------
            /        \
           /          \ 
         (3)          (4) 

What is the probabliity that this tree would have generated the data presented in the sequence under the the chosen model ?

Since most of the models currently used are time-reversible, the likelihood of the tree is generally independent of the position of the root. Therefore it is convenient to root the tree at an arbitrary internal node as done in the Fig. below,

    C    C  A      G
     \  /   |     /
      \/    |    /
       A    |   /
        \   |  /
         \  | /
            A

Under the assumption that nucleotide sites evolve independently (the Markovian model of evolution), we can calculate the likelihood for each site separately and combine the likelihood into a total value towards the end. To calculate the likelihood for site j, we have to consider all the possible scenarios by which the nucleotides present at the tips of the tree could have evolved. So the likelihood for a particular site is the summation of the probablilities of every possible reconstruction of ancestral states, given some model of base substitution. So in this specific case all possible nucleotides A, G, C, and T occupying nodes (5) and (6), or 4 x 4 = 16 possibilities :

                 _                _
                |  C    C A      G |           
                |   \  /  |     /  |            
                |    \/   |    /   |
L(j) = Sum(Prob |    (5)  |   /    |)
                |      \  |  /     |
                |       \ | /      |
                |_       (6)      _|
               

In the case of protein sequences each site may ooccupy 20 states (that of the 20 amino acids) an thus 400 possibilities have to be considered. Since any one of these scenarios could have led to the nucleotide configuration at the tip of the tree, we must calculate the probability of each and sum and sum them to obtain the total probability for each site j.

The likelihood for the full tree then is product of the likelihood at each site.

 

                               N 
L= L(1) x L(2) ..... x L(N) = ½ L(j)
                             j=1

Since the individual likelihoods are extremely small numbers it is convenient to sum the log likelihoods at each site and report the likelihood of the entire tree as the log likelihood.

 

                                            N 
ln L= ln L(1) + ln L(2) ..... + ln L(N) = SUM ln L(j)
                                          j=1


The model of evolution

The model of evolution that attributes to each possible nucleotide or amino-acid substitution a certain probability is essential to obtain the correct tree. In the case of protein sequences the simplest model is the Poisson model, which assumes that all changes between amino acids occur at the same rate. This assumption is clearly unreasonable for protein sequence data. Therefore, the PROTML program in the MOLPHY package (Adachi and Hasegawa, 1992), as well as the PUZZLE program by Strimmer and von Haeseler (1995), have implemented an instantaneous rate matrix derived from the Dayhoff emperical substitution matrix. This has been called the Dayhoff model. Recently a model called the JTT model of evolution and based upon the updated emperical substitution matrix of Jones et al. (1992) has been developed and and implemented in these programs.

The maximum likelihood tree
The above procedure is then repeated for all possible topologies (or for all possible trees).
The tree with the highest probablility is the tree with the highest maximum likelihood.



Last updated: 8 August 1997.
created by :Fred Opperdoes