Arguments in favour of a phylogenetic
analysis of the corresponding protein rather than the DNA
Codon bias
- Amino-acid codons have been degenerated with wobble in the
third position. Yeasts, protozoa, and animals have different codon
preferences, which would result in differences in DNA sequence
that are related to codon bias and not to evolution. Compare the
codon usage tables, for human and
yeast. (Click
here to access the Japanese CUTG database with codon usage
information for many different species)
- Also, the protozoa use the codons TAA and TGA to encode
glutamine, rather than STOP. and in mitochondria
the codon TGA encodes tryptophane, rather than STOP. The inclusion
of unique codons in a subset of the sequences will tend to make
that subset appear more divergent than they really are. (For more
information about organelles visit the Organelle
Database.)
- Therefore, it may be advantageous to first translate a coding
sequence or open reading frame of a gene into its corresponding
protein sequence. The results will be a peptide sequence in either
the one- or three-letter code.
- Working with proteins rather than with DNA means that you have
to know the three- and especially the one-letter code for amino
acids. It is important to familiarise you with the one-letter code
as quickly as possible. Here you may find them
Back to the Table of
Contents
Long Time Horizon
- Homologous sequences that diverge with time tend to
incorporate mutations more ore less at random. This will make the
two sequences more different from each other when time evolves.
The chance that a certain position in the DNA incorporates a
second mutation, so obscuring the first mutational event or even a
back mutation resulting in no obversable difference, will increase
with the total number of mutations that have been incorpotated and
thus will also increase with time. In protein-coding sequences the
first and second position of each codon are less prone to the
incorporation of mutations, because this will almost always lead
to a change in amino acid in the corresponding position of the
protein. The third position, also called the wobble position, in
most cases, may be mutated without directly affecting the protein.
Check this by analysing the universal
genetic code.
- When comparing protein coding sequences that have diverged for
possibly a billion years or more, it is very likely that the
wobble bases in the codons will have become randomized. By
excluding the wobble bases by removing every third nucleotide from
the potein coding sequence (a general technique in phylogenetic
analyses), one is actually looking at amino-acid sequences.
Sometimes it is cumbersome to remove the wobble bases from your
DNA sequences, while it is much easier to simply translate the
open reading frame into its corresponding protein sequence.
- There are several more reasons why it may be advantageous to
translate an open reading frame into protein before carrying out
phylogenetic analyses (see below)
Conclusion: due to wobble-base degeneracies the third bases in
the codons of a protein-coding gene is of little value in the
analysis of distantly related proteins!
Back to the Table of
Contents
Advantages of the translation of DNA
into protein
- DNA is composed of only four kinds of unit: A, G, C and T
- If gaps are not allowed, on average 25% of residues in two
randomly chosen aligned sequences would be identical.
- If gaps are allowed, as much as 50 % of residues in two
randomly chosen aligned sequences can be identical. Such a
situation may obscure any genuine relationship that may exist.
Especially when comparing distantly related or rapidly evolving
gene sequences.
- Moreover, it is easier to translate a gene sequence into its
corresponding protein sequence than to remove the third wobble
base from each of the codons in the gene.
- Translation of DNA into 21 different types of codon (20 amino
acids and a terminator) allows the information to sharpen up
considerably. Wrong frame information is set aside.
- Third-base degeneracies are consolidated.
- After insertion of gaps to align two random protein sequences
it can be expected that they are between 10-20% identical.
- As a result of the translation procedure the protein sequences
with their 20 amino acids are much more easy to align than the
corresponding DNA sequences with only 4 nucleotides.
- If, after this, you still want to align distantly related gene
sequences, you better prepare first a protein alignment and then
base your self on this alignment for the alignment of the
corresponding gene sequences and the precise placement of indels
in the aligned sequences.
Conclusion: The signal to noise ratio is greatly improved
when
using protein sequences over DNA sequences!
Back to the Table of
Contents
Nature of Sequence Divergence in
Proteins (the PAM)
- The observed sequence difference of two sequences that diverge
with time takes the course of a negative exponential. This is the
result not only of the fact that each position is subject to the
incorporation of mutations, but also to reverse changes ("back
mutations") and multiple hits. Such events increase in number as
the evolutionary distance between two homologous proteins
increases.
- This leads to an underestimation of evolutionary distances
between two homologous proteins and as a consequence the observed
percentage of difference between two protein sequences is not
proportional to the actual evolutionary difference.
- A measure that is proportional to the true evolutionary
distance between proteins is the PAM value. PAM (Dayhoff and Eck,
1968) is number of Accepted Point Mutations per 100 amino
acids.
Example of some PAM values and their corresponding observed distances
are given in the following Table.
Back to the Table of
Contents
Proteins Evolve at Highly Different
Rates.
- Pseudogenes and proteins with functions that are less
essential to the organism rapidly evolve. As a result evolutionary
information is quickly erased.
- House-keeping proteins, such as histones, enzymes of essential
metabolic pathways and proteins of the cytoskeleton, evolve slowly
and incorporate between 1 to 10 mutations per 100 residues and per
100 million years (inspect
the Table on protein-evolution rates). Therefore, it takes a
considerable time before they have incorporated sufficient
(250-350 substitutions per 100 residues) before all evolutionary
information has been erased. Because of this slow rate of
evolution house-keeping proteins are excellent tools to trace
evolutionary relationships over long periods of time. For instance
the slow mutation rate of glycolytic enzyme
glyceraldehyde-3-phosphate dehydrogenase provides us with a
look-back window of some 9 billion years, twice the age of our
solar system (see the Table on important dates).
Conclusion: proteins are excellent tool to study the evolutionary
relationships of both closely as well as distantly related taxa!
Back to the Table of
Contents
- Many proteins are encoded on each piece of DNA and so, when
confronted with a DNA sequence, a biologist needs to figure out
where the code for a protein starts and stops. This problem is
even more difficult because a eukaryotic genome contains much more
DNA than is needed to encode proteins; the sequence of a random
piece of DNA is likely to encode no protein whatsoever.
- The DNA which encodes proteins is often not continuous, but
rather is frequently scattered in separate blocks called exons.
Many of these problems can be reduced by sequencing of RNA (via
cDNA) rather than DNA itself, because the cDNA contains much less
extraneous material, and because the separate exons have been
joined in one continuous stretch in the RNA (cDNA). There are
situations, however, where analysis of RNA is not possible and the
DNA itself needs to be analyzed
- Eukaryotic genes in general have been fragmented into exons
and interspering introns. Due to differences in evolutionary
pressure on exons and introns the rate of incorporation of base
substitutions in these two elements of eukaryotic genes may be
dramatically different. Therefore, a study of the evolution of a
protein using its DNA sequence should only include coding
sequences. This requires that in every DNA sequence all the
introns are being edited out. This may be cumbersome and time
consuming. Therefore, it may be easier to translate a cDNA into
its corresponding protein, rather than using with the genomic DNA
sequences.
- Although a much greater fraction of RNA encodes protein than
does DNA, it is certainly not the case that all RNA encodes
protein. In the first case, there can be RNA up- and down-stream
of the coding region. These non-coding regions can be quite large,
in some cases dwarfing the coding region. Further, not all RNAs
encode proteins. Ribosomal RNA (rRNA), transfer RNA (tRNA), and
the structural RNA of small nuclear ribonucleoproteins (snRNA) are
all examples of non-coding RNA.
- By and large, global, complete solutions are not available for
determining an encoded protein sequence from a DNA sequence.
However, by combining a variety of computational approaches with
some laboratory biology, scientists have been fairly successful at
accomplishing this in many specific cases. Nonetheless, this
problem is currently considered one of the most important in
computational biology.
Back to the Table of
Contents
Multigene Families
- Organisms may contain many highly similar genes, while only
one peptide sequence can be identified (e.g. histones and GAPDH in
humans). Using these DNA sequences, it would be difficult to
decide which genes are expressed and which are not and thus to
decide which genes to include in the analysis. Moreover, if all
the genes that are expressed encode the same protein, then DNA
differences are not significant.
Back to the Table of
Contents
Protein is the Unit of Seletion
- The fundamental building blocks of life are proteins. Enzymes,
which are the molecular machines responsible for virtually all of
the chemical transformations that cells are capable of, are
proteins. In addition, much of the structure of a cell is made up
of proteins. That part of the structure which is not made up of
proteins is produced by enzymes which are proteins. Mycoplasma
genitalium is the smallest known genome that is not a virus.
It codes for 468 proteins, that have been called the minimal set
for life. A recent estimate of the smallest number of proteins
required to make a cell is 256 (click
here for more details on how many proteins are essential to
make a cell). E. coli contains 4288
genes and a yeast cell contains about 6000
genes. A human contains on the order of 100,000 different
proteins. It is the properties of and the interactions between
these 100,000 proteins that make us what we are.
- Proteins are variable length linear, mixed polymers of 20
different amino acids. Other terms used more or less
interchangably for amino-acid polymers are peptides and
polypeptides. These topologically linear polymers fold upon
themselves to generate a shape characteristic of each different
protein, and this shape along with the different chemical
properties of the 20 amino acids determine the function of the
protein. One of the most important concepts in modern biology is
that the functional properties of proteins is determined largely
by the sequence of the 20 amino acids in the linear polypeptide
chain; that in many cases proteins are largely self-folding. Thus,
in theory, knowing the sequence of a protein (the order with which
the amino acids occurred) one could infer its function.
- For protein-encoding genes, the object on which natural
selection acts is the protein itself and not the DNA. The
underlying DNA sequence reflects this process in combination with
species-specific pressures on DNA sequence (like the need for
thermophiles to have DNA that is resistant to melting or a very
high or low GC content). Thus if function demands that a protein
maintains a specific sequence, there still is sufficient room for
the DNA sequence to change.
Conclusion: where possible use a translated cDNA sequence for
your protein analysis!
Back to the Table of
Contents
RNA Editing
The DNA sequence doesn't always translate into amino-acid sequence.
The pre-mRNA may require alteration of its coding sequence before it
can be translated into a funtional protein. This is called
post-transcriptional editing. In post-transcriptional editing several
different mechanisms are known. These are:
- RNA editing in the Kinetoplastida.
This involved the insertion or deletion of one or more Us in the
pre-mRNA, using guide RNAs as templates. This way non-coded
initiation codons or amino acids are added or coded amino acids
are removed during the editing process. This could lead to major
differences between DNA and mature mRNA sequence. In some extreme
cases (in Trypanosoma brucei sometimes more than 50% of a
genes is edited. Such genes are called pan-edited
genes). DNA and mature mRNA do not hibridise anymore to each
other. Nevertheless this leads to roughly the same protein
sequence after final editing. (Some details about editing in
Trypanosomatidae can be found here or
on the RNA
editing site of Larry Simpson).
- Post-transcriptional base modification in some gene
products.
Examples of these are:
- Modification of rRNAs
- Modification of tRNAs
- Modification of the apo-lipoprotein B mRNA creating an
additional termination codon
Conclusion: Peptide sequences are not always identical to what
is predicted by the corresponding genes!
Back to the Table of
Contents
Some good advice
It is recommended to analyse your data set both ways (DNA and
protein).
Keep in mind that:
- For a group of species or taxa that are relatively close in
time or that are closely related (like viral proteins or
vertebrate enzymes) DNA-based analysis is probably a good way to
go, since you avoid such problems as differences in codon bias or
saturation of the third position of codons. It is nevertheless
strongly recommended to carry out an analysis on the protein data
as well.
- Multigene families (for instance genes coding for different,
but similar, isoenzymes) may cause you problems and be careful
when you decide to exclude or include such sequences (this may
result in paralogous
sequences in your data set and peculiarly looking phylogenetic
trees).
Back to the Table of
Contents
Construction of Phylogenetic Trees
Alignment of two protein sequences
In order to study molecular evolution of proteins one has to
compare the sequence of a protein with that of other homologous
proteins (click here for an explanation of
the term homologous and the difference between homology and
identity). However, such a comparison is not easy to make because the
sequences that need to be compared are usually not identical, but
only similar. In addition to having substitutions, there will be
insertions and deletions in one sequence relative to the other.
Phylogenetic analyses should be carried out on homologous residues
only (i.e. those residues in each of the two sequences that
originates from a common ancestral residue). Thus it is essential
that two or more sequences be properly aligned relative to each
other. Several excellent programs have been developed for the
alignment of multiple DNA or protein sequences. Some like DNA*,
GeneWorks or MacVector are commercial and expensive but run on the PC
or Macintosh sitting on your desktop. You may have access to a
computer center where various programs, both commercial and freeware,
have been installed, such as the Pileup program of the Wisconsin GCG
(Genetics Computer Group) package that is available through the
Belgian EMBL node computer (BEN)
to which you can connect for a reasonable annual fee. You might go
out on the network and download the source code for a program such as
ClustalW, which is available for many different platforms and which
you compile and run on your own computer. Or finally, there are now
places on the network where such programs such as ClustalW
and Darwin
run that you can connect to and use. This latter possibility will be
covered in my Demo chapter.
Back to the Table of
Contents
The visual method
How do programs such as ClustalW, Pileup, Multalign, etc. work ?
The easiest way to understand this is to have a look at a visual type
of alignment of two sequences such as is carried out by the
"Dot-matrix" method. In this method the two sequences to be aligned
are written out as column and row headings of a matrix. Dots are put
in the matrix wherever the residues in the two sequences are
identical. If the two sequences are identical there will be dots in
all the diagonal elements of the matrix. If the two sequences are
different, but can be aligned without gaps, there will be dots in
most of the diagonal elements. If a gap occurred in one of the
sequences, the alignment diagonal will be shifted vertically or
horizontally.
The figure here shows a dot-matrix for
two highly homologous trypanosome phosphoglycerate kinases and two
more distantly related phosphoglycerate kinase sequences from
Trypanosoma and Euglena . It is obvious that there is no problem in
aligning two sequences as long as they are of similar length and have
more than 50 % identity.
Back to the Table of
Contents
Computer algorithms
Computer algorithms that have been developed for the alingment of
two homologous sequences in principle use the same procedure as the
dot matrix method. Each residue of one sequence is compared with each
residue of the other and when there is an identity a certain value is
given to that position in the matrix.
|
|
C
|
D
|
E
|
G
|
L
|
D
|
|
C
|
x
|
|
|
|
|
|
|
D
|
|
x
|
|
|
|
x
|
|
E
|
|
|
x
|
|
|
|
|
G
|
|
|
|
x
|
|
|
|
L
|
|
|
|
|
x
|
|
|
D
|
|
x
|
|
|
|
x
|
Identity score for wo identical sequences
Then a diagonal line is drawn connecting the points with the
highest score. Horizontal or vertical shifts from the diagonal due to
the presence of gaps are given a penalty.
|
|
C
|
D
|
E
|
L
|
D
|
|
C
|
x
|
|
|
|
|
|
D
|
|
x
|
|
|
x
|
|
E
|
|
|
x
|
|
|
|
G
|
|
|
|
|
|
|
L
|
|
|
|
x
|
|
|
D
|
|
x
|
|
|
x
|
Identity score for two sequences with one indel
The choice of the value for each positive score relative to that
of the gap penalty will of course stronly influence the quality of
the resulting alignment. Too high a gap penalty value will lead to a
situation where dissimilar regions will not be aligned with each
other at all, but with gap regions, while a too low gap penalty will
lead to alignment of non-homologous residues or regions. Most
programs allow the user to select an appropriate weigth matrix for
the scoring of either identities or similarities of amino acids and
for adjustment of the gap penalty value.
Many different weight matrices have been developed for the use
with sequence alignment programs. Some of these are:
- Identity matrix
- Mutation-cost matrix
- Hydrophobicity matrix
- PAM1 matrix
- PAM250 matrix
- BLOSUM (Block Substitution) matrix
- Log odds matrix
More detailed information about these matrices can be obtained from
the document: "Weight Matrices for Similarity
Scoring" compiled by David Wheeler.
It is up to the scientific judgement of the user, and depending on
the dataset that is being analysed, what kind of weight matrix will
be chosen.
Several algorithms for the alignment of protein sequences have
been developed. Some of the better known are the Pearson-Lipman
algorithm, used in Pearson's well known FASTA program, the Needleman
Wunsch (check, ref), the Smith-Waterman (ref) and the BLAST
algorithms. They are used in sequence-comparison programs for the
search of homologous sequences in large sequence databases and in
programs used for the creation of multiple sequence alignment.
Back to the Table of
Contents
Multiple sequence alignment
In order to create a multiple alignment of homologous protein
sequences you have to collect all related sequences from a database.
Several databases are available for this purpose. First of all the
SwissProt database would be the
most reliable source for your sequences. The advantage of this
database is that each sequence is checked and extensively annotated.
Moreover, homologous sequences from different organisms have highly
similar locus names, which tremendously facilitates the recognition
and retrieval of related sequences. The Enzyme
or EC database is a useful tool to retrieve all homologous
sequences of one specific enzyme by FTP. Other databases for the
retrieval of related sequences are the homologous protein domain
databases Prodom
and the WIT
(What Is There (PUMA)) databases.
For the construction of reliable phylogenetic trees the quality of
a multiple alignment of the protein sequences is of the utmost
importance.
There are many programs available for the multiple alignment of
proteins. Click
here for an extensive upto date list of multiple alignment software
complied by Georg Fuellen
Good programs in the public domain are:
- on the World-Wide Web or available as freeware:
- Pileup of the Wisconsin GCG (Genetics Computer Group)
package
Click here for an example of an alignment
in Pileup MSF format.
They quickly align pairs of sequences and roughly determine the
degrees of identity between each pair. Then the sequences are aligned
more precisely in a progressive way, starting with the two most
related sequences.
Darwin
is a large suite of programs that allows you to carry out many
different types of analysis on a collection of proteins, including
the construction of a phylogenetic tree.
ProteinPredict
is a program for the prediction of the secondary structure of a
protein. It makes use of the information on the mutatability of each
residue in a multiple sequence alignment. Therefore it first aligns
the query sequence with all homologous sequences available in the
SwissProt database. So this is an easy way to create a multiple
sequence alignment that includes your query sequence.
The algorithm of the Pileup program of GCG resembles that of
Clustal and both programs give more or less the same result. The
Wisconsin GCG (Genetics Computer Group) package is a commercial
package for DNA and protein analysis. It can be accessed via the BEN
(Belgium EMBNet Node) computer,
provided that you or your laboratory have an account there.
NB: Most multiple sequence alignment programs work best
when the sequences have similar length.
For an extensive list of multiple alignment resources, both for
the creation and the analysis of alignments, compiled by Georg
Fuellen and available on the net: click
here.
Back to the Table of
Contents
Prodom and Blocks databases
There is another way to obtain multiple sequence alignments to
which your sequence can be aligned. There exist databases of
prealigned sequences that share domain structures or homologous
blocks of sequences. These databases are called Prodom,
Pfam and
Blocks, and are accessible
via the Internet. They have been compiled by comparing and aligning
all homologous sequences of the SwissProt database of protein
sequences. The differences between the Prodom and Blocks databases is
that in Prodom gaps are allowed, whereas in the Blocks database, that
has been compiled using the Blast algorithm, no gaps are allowed.
Therefore, in general the compiled sequences in Blocks are shorter
than in Prodom. Prodom and Blocks alignments serve as an excellent
basis to start a multiple sequence alignment and/or phylogeny
project.
Back to the Table of
Contents
Manual adjustment of a protein
alignment
An automatically produced multiple sequence alignment often needs
manual adjustment to improve the quality of the alignment. Such
improvement can be obtained by using all the knowledge that is
available about a protein.
Some rules of thumb:
- The rules for mutation of amino acids are dependent on their
physicochemical properties.
- Surface residues (DRENK)
are preferably mutated to residues of similar properties. Since
they are polar or charched they are mainly on the outside of the
folded protein. Since they are not, or less, involved in protein
folding they mutate rather easily.
- Hydrophobic residues
(FAMILYVW) are preferentially
replaced by other hydrophobic ones. These residues are mainly
internal and determine the folding of the protein. They thus
mutate rather slowly.
- The residues CHQST are
generally indifferent and may be replaced with any other type of
residue.
- The residues
(DRENKCHQST),
when conserved throughout the alignment, are very likely residues
that are involved in the active site. So the multiple alignment
should be adjusted in such a way as to maintain them aligned.
- Periodicity of charged residues may provide information as to
the presence of elements of secondary structure such as
alpha-helices and beta-strands. Alpha helices have a repetition of
3.6 residues per turn. Stretches of more than 12 amino acids with
a charged amino acid every 3rd or 4th residue may be indicative of
the presence of an alpha helix. Repetition of charged amino acids
every 2nd residue may be indicative of a beta strand
structure
- Indels (insertions/deletions) are almost never found in
elements of secondary structure but only in loops. Moreover
P and
G interfere with secondary
structure elements and thus have a preference for loops. Since
loops easily acquire or loose residues you should always try to
align indels with P and
G residues.
- Hydrophobicity (or hydrophathy) profiles according to Kyte and
Doolittle of two homologous proteins are in general strikingly
similar (see figure below).
.
Hydrophobicity profile according to Kyte and
Doolittle
Click here for a
comparison of of the hydropathy profiles of the phosphoglycerate
kinases from Trypanosoma congolense, T. brucei and
Euglena gracilis.
Click here to see a typical alignment of
homologous PGK sequences
Another useful tool for the manual alignment of proteins, in the
case there exists a crystal structure of at least one of the enzymes
of the alignment, is the NLR-3D
database. This database contains protein sequences plus
secondary structure information. It tells you which residues of a
protein sequence belong to conserved areas of secondary structure
such as alpha helices and beta strands. Such areas almost never
contain indels.
Back to the Table of
Contents
Continue with Protein Phylogeny
Last updated: 26 September 1997.
created by :Fred
Opperdoes