Methods for the inference of protein phylogeny
By Fred R. Opperdoes
Introduction
Once a multiple sequence alignment has been prepared, such
alignment may serve the purpose of further evolutionary analyses. The
final goal of such an analysis is to prepare an evolutionary tree
describing the relationship of the various taxa with respect to each
other. In order to understand the terminology used in the area of
phylogeny, study the hypothetical tree shown below:

Example of a phylogenetic tree (for
a normal size figure click here)
There exist various methods for the preparation of evolutionary
trees: These are "Distance Methods" based on a matrix containing pair
wise distance values between all sequences in the alignment, and
"Character-Based Methods" that carry out calculations on each of the
individual residues of the sequences. In general, distance methods
are fast, while character-based methods are much slower, because they
are CPU (central processing unit) intensive.
Back to the Table of
Contents.
Methods available for tree
construction
- Distance methods (in the public domain)
- UPGMA,
- Transformed Distance Methods,
- Least Squares Methods. Fitch and Kitsch of the Phylip
package (Jo Felsentein, Univ. Washington) are least squares
method (Fitch and Margoliash)
- Neighbor-joining methods
- Neighbor of the Phylip package (Jo Felsentein, Univ.
Washington),
- ClustalW (D. Higgins, EMBL (check),
- Distnj in Protml package (Adachi and Hasegawa, Univ.
Tokyo)
- Darwin (Gaston Gonner, ETH, Zurich, via mail server or
WWW)
- Character-based methods
- Protein Maximum likelihood (in the public domain)
- Protml (Adachi and Hasegawa, Univ. Tokyo) (very cpu
intensive)
- Puzzle (Korbinian Strimmer and Arndt von Haeseler, Univ.
Munich) (A heuristic method much faster than Protml)
- Protein maximal parsimony (in the public domain)
- Protpars (Jo Felsentein, Univ. Washington)
- Paup (David Swofford, latest version will be
commercial)
Back to the Table of
Contents
Distance Matrix Methods
NB:These methods, which all assume an evolutionary model,
result in only one best possible tree.
Character-based methods
NB: For more information on the maximum likelihood methods:
click here
.Back to the Table of
Contents
How to root an unrooted tree?
- Most methods for the inference of phylogeny yield trees that
are unrooted. Thus from a tree by itself it is impossible to tell
which of the OTUs branched off before all the others.
- To root a tree one should add an outgroup to the data set. An
outgroup is an OTU for which external information (eg.
paleontological information) is available that indicates that the
outgroup branched off before all other taxa.
Some good advice to root a tree
- Do not choose an outgroup that is very distantly related to
your taxa. This may result in serious topolocical errors because
sites may have become saturated with multiple mutations by which
information may have been erased.
- Do not choose either an outgroup that is too closely related
to the taxa in question. In this case it may not be a true
outgroup.
- The use of more than one outgroup generally improves the
estimate of tree topology.
- In the absence of a good outgroup the root may be positioned
by assuming approximately equal evolutionary rates over all the
branches. In this way the root is put at the midpoint of the
longest pathway between two OTUs. This way of rooting is called
mid-point rooting.
NB: Tree topologies may strongly depend on the
following:
- DNA or Protein used in the analysis
- Distance or Parsimony methods applied
- The number of OTUs included in the alignment
- The order of the OTUs in the alignment
- The selection of an appropriate outgroup
NB: None of the methods may guarantee the one tree with the
correct topology.
.Back to the Table of
Contents
How to use the tree-construction
programs?
So as to have an idea about the reliability of the topology of the
resulting tree, one should do one or all of the following:
- Apply more than one of different methods (distance, maximum
parsimony, maximum likelihood) to the data set.
- Vary the parameters used by the different programs, such as
the seed value and jumble factor for the order of OTU
addition.
- When in doubt, apply various evolutionary models for matrix
construction.
- Add or remove one or more OTUs and see how this influences
tree topology.
- Try to include an outgroup that may serve as a root for your
tree.
- Apply "Bootstrap" or "Jacknife"
analyses to your data set and prepare a consensus tree of 100 -
1000 replicas (depending on size of the data set and on computer
power). Keep in mind that in the case of bootstrap analysis only
nodes that occur in more than 95% of the cases are reliable.
NB: Only when widely different methods provide you with
similar or identical tree topologies and such topologies are
supported by good bootstrap values (> 95%) the trees can be
considered reliable.
.Back to the Table of
Contents
Limitations of the various methods
- Distance approaches (UPGMA, corrected distances and
neighbor-joining) do not use the original (sequence) data, but
derived distance information. Some information is said to be
lost.
- Character-state approaches (maximum parsimony, maximum
likelihood) are said to be more powerful than distance methods
because they use the raw data. However, this is usually a small
fraction of the data. Maximum parsimony uses only the relevant
sites. So when the number of informative sites is not large, this
method is often less efficient than distance methods (Saitou and
Nei, 1986). Maximum parsimony is notorious for its sensitivity to
codon bias and unequal rates of evolution.
- None of the methods is reliable when OTUs with highly unequal
evolutionary separation are included in the data set.
.Back to the Table of
Contents
Complication of paralogous genes
The presence of more than one homologue of a certain gene, or of
different members of a gene family, in one and the same organism may
complicate considerably phylogenetic analyses. When such a situation
is encountered one speaks of the presence of paralogous genes.
Two genes are said to be orthologous if they diverged after a
speciation event, whereas they are said to be paralogous if they
diverged after a duplication event.
Let's take the example of the mammalian lactate dehydrogenase
isoenzymes M and L. In the case of mouse and rat the isoenzymes are
the result of a gene duplication that took place well before the
separation of these two species.

where () indicates speciation and X indicates gene
duplication.
Here one says that the LDH_M gene family is paralogous to the
LDH_L gene family. In the case one is not aware of the presence of
paralogous genes and isoenzymes in the organisms, because for
instance only one sequence for each organism (e.g. LDH_L for mouse
and LDH_M for rat) is available and isoenzyme data are missing, then
the resulting phylogeny would suggest a much earlier separation of
mouse and rat. This will inevitably lead to erroneous phylogenetic
trees.
Here you'll be able to find more explanation
on orthologous and paralogous genes.
Definition: Paralogy is the construction of a phylogenetic
tree from a mixture of genes generated by duplications.
To find out what is the difference between Cladistics and
Phenetics and to read about the differences between the various
phylogeny inference methods, click
here.
Back to the Table of
Contents
Still to add:
some literature examples
Origin of chloroplasts in J. Molec. Evolution, 1995
Back to the Table of
Contents
Last updated: 25 September 1997.
created by :Fred
Opperdoes