Arguments in favour of a phylogenetic analysis of the corresponding protein rather than the DNA

Codon bias

 

Back to the Table of Contents


 

Long Time Horizon


Conclusion: due to wobble-base degeneracies the third bases in the codons of a protein-coding gene is of little value in the analysis of distantly related proteins!

Back to the Table of Contents


Advantages of the translation of DNA into protein


Conclusion: The signal to noise ratio is greatly improved when
using protein sequences over DNA sequences!

Back to the Table of Contents


Nature of Sequence Divergence in Proteins (the PAM)


Example of some PAM values and their corresponding observed distances are given in the following Table.

 

Back to the Table of Contents


Proteins Evolve at Highly Different Rates.


Conclusion: proteins are excellent tool to study the evolutionary relationships of both closely as well as distantly related taxa!

Back to the Table of Contents


Introns and Non-Coding DNA

Back to the Table of Contents


Multigene Families

Back to the Table of Contents


Protein is the Unit of Seletion


Conclusion: where possible use a translated cDNA sequence for your protein analysis!

Back to the Table of Contents


 

RNA Editing


The DNA sequence doesn't always translate into amino-acid sequence. The pre-mRNA may require alteration of its coding sequence before it can be translated into a funtional protein. This is called post-transcriptional editing. In post-transcriptional editing several different mechanisms are known. These are:


Conclusion: Peptide sequences are not always identical to what is predicted by the corresponding genes!

Back to the Table of Contents


Some good advice


It is recommended to analyse your data set both ways (DNA and protein).

Keep in mind that:

Back to the Table of Contents


Construction of Phylogenetic Trees

Alignment of two protein sequences

In order to study molecular evolution of proteins one has to compare the sequence of a protein with that of other homologous proteins (click here for an explanation of the term homologous and the difference between homology and identity). However, such a comparison is not easy to make because the sequences that need to be compared are usually not identical, but only similar. In addition to having substitutions, there will be insertions and deletions in one sequence relative to the other. Phylogenetic analyses should be carried out on homologous residues only (i.e. those residues in each of the two sequences that originates from a common ancestral residue). Thus it is essential that two or more sequences be properly aligned relative to each other. Several excellent programs have been developed for the alignment of multiple DNA or protein sequences. Some like DNA*, GeneWorks or MacVector are commercial and expensive but run on the PC or Macintosh sitting on your desktop. You may have access to a computer center where various programs, both commercial and freeware, have been installed, such as the Pileup program of the Wisconsin GCG (Genetics Computer Group) package that is available through the Belgian EMBL node computer (BEN) to which you can connect for a reasonable annual fee. You might go out on the network and download the source code for a program such as ClustalW, which is available for many different platforms and which you compile and run on your own computer. Or finally, there are now places on the network where such programs such as ClustalW and Darwin run that you can connect to and use. This latter possibility will be covered in my Demo chapter.

Back to the Table of Contents


The visual method

How do programs such as ClustalW, Pileup, Multalign, etc. work ? The easiest way to understand this is to have a look at a visual type of alignment of two sequences such as is carried out by the "Dot-matrix" method. In this method the two sequences to be aligned are written out as column and row headings of a matrix. Dots are put in the matrix wherever the residues in the two sequences are identical. If the two sequences are identical there will be dots in all the diagonal elements of the matrix. If the two sequences are different, but can be aligned without gaps, there will be dots in most of the diagonal elements. If a gap occurred in one of the sequences, the alignment diagonal will be shifted vertically or horizontally.

The figure here shows a dot-matrix for two highly homologous trypanosome phosphoglycerate kinases and two more distantly related phosphoglycerate kinase sequences from Trypanosoma and Euglena . It is obvious that there is no problem in aligning two sequences as long as they are of similar length and have more than 50 % identity.

Back to the Table of Contents


Computer algorithms

Computer algorithms that have been developed for the alingment of two homologous sequences in principle use the same procedure as the dot matrix method. Each residue of one sequence is compared with each residue of the other and when there is an identity a certain value is given to that position in the matrix.

 

 C

 D

 E

 G

 L

 D

 C

 x

 

 

 

 

 

 D

 

 x

 

 

 

 x

 E

 

 

 x

 

 

 

 G

 

 

 

 x

 

 

 L

 

 

 

 

 x

 

 D

 

 x

 

 

 

 x

Identity score for wo identical sequences

Then a diagonal line is drawn connecting the points with the highest score. Horizontal or vertical shifts from the diagonal due to the presence of gaps are given a penalty.

 

 C

 D

 E

 L

 D

 C

 x

 

 

 

 

 D

 

 x

 

 

 x

 E

 

 

 x

 

 

 G

 

 

 

 

 

 L

 

 

 

 x

 

 D

 

 x

 

 

 x

Identity score for two sequences with one indel

The choice of the value for each positive score relative to that of the gap penalty will of course stronly influence the quality of the resulting alignment. Too high a gap penalty value will lead to a situation where dissimilar regions will not be aligned with each other at all, but with gap regions, while a too low gap penalty will lead to alignment of non-homologous residues or regions. Most programs allow the user to select an appropriate weigth matrix for the scoring of either identities or similarities of amino acids and for adjustment of the gap penalty value.

Many different weight matrices have been developed for the use with sequence alignment programs. Some of these are:


More detailed information about these matrices can be obtained from the document: "Weight Matrices for Similarity Scoring" compiled by David Wheeler.

It is up to the scientific judgement of the user, and depending on the dataset that is being analysed, what kind of weight matrix will be chosen.

Several algorithms for the alignment of protein sequences have been developed. Some of the better known are the Pearson-Lipman algorithm, used in Pearson's well known FASTA program, the Needleman Wunsch (check, ref), the Smith-Waterman (ref) and the BLAST algorithms. They are used in sequence-comparison programs for the search of homologous sequences in large sequence databases and in programs used for the creation of multiple sequence alignment.

Back to the Table of Contents


Multiple sequence alignment

In order to create a multiple alignment of homologous protein sequences you have to collect all related sequences from a database. Several databases are available for this purpose. First of all the SwissProt database would be the most reliable source for your sequences. The advantage of this database is that each sequence is checked and extensively annotated. Moreover, homologous sequences from different organisms have highly similar locus names, which tremendously facilitates the recognition and retrieval of related sequences. The Enzyme or EC database is a useful tool to retrieve all homologous sequences of one specific enzyme by FTP. Other databases for the retrieval of related sequences are the homologous protein domain databases Prodom and the WIT (What Is There (PUMA)) databases.


For the construction of reliable phylogenetic trees the quality of a multiple alignment of the protein sequences is of the utmost importance.

There are many programs available for the multiple alignment of proteins. Click here for an extensive upto date list of multiple alignment software complied by Georg Fuellen

Good programs in the public domain are:

They quickly align pairs of sequences and roughly determine the degrees of identity between each pair. Then the sequences are aligned more precisely in a progressive way, starting with the two most related sequences.

Darwin is a large suite of programs that allows you to carry out many different types of analysis on a collection of proteins, including the construction of a phylogenetic tree.

ProteinPredict is a program for the prediction of the secondary structure of a protein. It makes use of the information on the mutatability of each residue in a multiple sequence alignment. Therefore it first aligns the query sequence with all homologous sequences available in the SwissProt database. So this is an easy way to create a multiple sequence alignment that includes your query sequence.

The algorithm of the Pileup program of GCG resembles that of Clustal and both programs give more or less the same result. The Wisconsin GCG (Genetics Computer Group) package is a commercial package for DNA and protein analysis. It can be accessed via the BEN (Belgium EMBNet Node) computer, provided that you or your laboratory have an account there.

NB: Most multiple sequence alignment programs work best when the sequences have similar length.

For an extensive list of multiple alignment resources, both for the creation and the analysis of alignments, compiled by Georg Fuellen and available on the net: click here.


Back to the Table of Contents


Prodom and Blocks databases

There is another way to obtain multiple sequence alignments to which your sequence can be aligned. There exist databases of prealigned sequences that share domain structures or homologous blocks of sequences. These databases are called Prodom, Pfam and Blocks, and are accessible via the Internet. They have been compiled by comparing and aligning all homologous sequences of the SwissProt database of protein sequences. The differences between the Prodom and Blocks databases is that in Prodom gaps are allowed, whereas in the Blocks database, that has been compiled using the Blast algorithm, no gaps are allowed. Therefore, in general the compiled sequences in Blocks are shorter than in Prodom. Prodom and Blocks alignments serve as an excellent basis to start a multiple sequence alignment and/or phylogeny project.

Back to the Table of Contents


Manual adjustment of a protein alignment

An automatically produced multiple sequence alignment often needs manual adjustment to improve the quality of the alignment. Such improvement can be obtained by using all the knowledge that is available about a protein.

Some rules of thumb:

.Hydrophobicity profile

Hydrophobicity profile according to Kyte and Doolittle

Click here for a comparison of of the hydropathy profiles of the phosphoglycerate kinases from Trypanosoma congolense, T. brucei and Euglena gracilis.

Click here to see a typical alignment of homologous PGK sequences

Another useful tool for the manual alignment of proteins, in the case there exists a crystal structure of at least one of the enzymes of the alignment, is the NLR-3D database. This database contains protein sequences plus secondary structure information. It tells you which residues of a protein sequence belong to conserved areas of secondary structure such as alpha helices and beta strands. Such areas almost never contain indels.


Back to the Table of Contents

Continue with Protein Phylogeny


Last updated: 26 September 1997.

created by :Fred Opperdoes