Introduction
Antonio, a post-doc from Argentina in the TROP Unit, has purified both the glycosomal and mitochondrial malate dehydrogenase enzymes from Phytomonas. As the last part of his project he wants to spent a few months on the cloning and sequencing of the gene coding for the glycosomal isoenzyme. While very busy sequencing he has little time to analyse his data in any detail, so he decides to keep this to the end of his stay, because with the help of the World Wide Web he knows he can do the analyses within an afternoon just before leaving Brussels. So by the time he has a complete sequence he connects his computer to the web and tries to get all the information required to confirm the identity of his sequence. He only uses the software available on the Internet. All he needs is a modem, a telephone line number that gives direct access to the Internet, a WWW client program such as Netscape, Explorer or Mosaic and a word-processor that allows him to cut and paste text.
NB: Explanation of text colours:
Click here to see the open reading frame (ORF) Toni sequenced:
To find out whether or not the Phytomonas ORF has any homology to other nucleic acid sequences in the GenBank database he performs a BLAST (Basic Local Alignment Sequence Tool) search using the WWW (World Wide Web) server of the NCBI (National Center for Biotechnology Information) at the NIH (National Institutes of Health) in Bethesda, USA) .
(Click on the above internet address to open a BLAST sequence submission window at NCBI. Then copy the entire sequence by using Copy and Paste in the Edit menu from this window into the Blast sequence submission window. Use all standard settings. Select the complete GenBank database (nr). Submit the sequence to NCBI and wait for the result to arrive)
Here is a sample output of the result of the search that is returnd within a few minutes:
(Compare your own output with the one from August 1997 to find out whether and how many new sequences have been added to the database since that date)
Since the BLAST search reports homology with many malate dehydrogenases, Toni has good hopes that he has cloned the correct gene. He now takes a look at one of the reported sequences from the database. (Click on one of the accession numbers, for instance gb|AF004202|AF004202 in the BLAST output and the following lines of the GenBank record should appear on the screen:)
LOCUS AF004202 864 bp DNA BCT 01-AUG-1997 DEFINITION Escherichia coli malate dehydrogenase (mdh) gene, partial cds. ACCESSION AF004202 NID g2289314 KEYWORDS . SOURCE Escherichia coli. ORGANISM Escherichia coli Eubacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (bases 1 to 864) AUTHORS Pupo,G.M., Karaolis,D.K.R., Lan,R. and Reeves,P.R. TITLE Evolutionary relationships among pathogenic and non-pathogenic Escherichia coli strains inferred from multilocus enzyme electrophoresis and mdh sequence studies JOURNAL Infect. Immun. 65 (7) (1997) In press REFERENCE 2 (bases 1 to 864) AUTHORS Pupo,G.M., Karaolis,D.K.R., Lan,R. and Reeves,P.R. TITLE Direct Submission JOURNAL Submitted (15-MAY-1997) Microbiology, University of Sydney, Department of Microbiology, G08 Maze Cresent, Sydney, N.S.W. 2006, Australia ----- (Text truncated here)
Since the above record provides little more information other than the mention of mdh or malate dehydrogenase and since the paper mentioned in the record is still in press and thus not yet available in the Institute's library, Toni decides to read the abstract from this on line. He knows that in general sequences are submitted to the database at the time a paper is submitted to a journal and that Genbank releases the sequence by the time the paper is published. Using the GenBank ID (AF004202) as query, he now accesses the Entrez server at the NCBI to search the paper in the molecular biology subset of Medline, the database of biomedical literature information. (Carry out this search. It will depend on how much time has elapsed after the submission date of the above GenBank record whether or not you will have a positive result).
Here output from Entrez
From the information that Toni obtained so far, he concludes that there are very strong indications that the Phytomonas ORF he got indeed codes for a MDH or MDH-related protein.
Toni now decides to translate the ORF into a protein sequence. On the Web at the WWW server at the EBI (European Bioinformatics Institute, an outstation of the EMBL located in Hinxton Hall near Cambridge, UK) there is the PROTEIN MACHINE utility allowing the translation of nucleotide sequences. and submits the sequence.
(Submit the sequence by pasting the nucleotide sequence into the sequence window of the Protein Machine and then submit it).
After a few seconds the server returns the following :
Translation:
You are translating a DNA string of length=966
Which should yield a protein of 322 amino acids.
MAHVCVVGAAGGVGQALSLLLTRSLPYGSTLSLYDVVGAPGVAADLSHID NAGVTVKFAAGKIGVTRDPALAALATGVDVFVFVAGGPIMPGMKRDDLFN STAGIVLDLVMTCASSSPKAMFCMISNPVNSTVPIAAEVLKKLGVYNKNR LLGVTRLDMLRATRFINEARMPLVVDRVPVVGGHSDNSIVPLFHQLQGPL PPKEQLDKITLRVQSAAYEVIDAKGGRGSATLAMGEAGARFVLDVVKGLT GASNPLVYAYVDTDGQSESEFLAIPVILGKSGIERRLPIGPMTESEKKLV DVAISIVKKNIEKGKEFALSKL
The complete output from the ProteinMachine is here
(Access the Prosite server and do two things: scan the protein sequence against the database to see if it is recognised as a MDH and retrieve the MDH record in the database to read more about the specific characteristics of this enzyme).
Indeed his protein sequence is recognised as a MDH (click here to see the result). It recognizes the active site signature 154-166 VTRLDMLRATRFI, as well as a microbody targetting signal 320-322 SKL, confirming that he has cloned the correct glycosomal MDH and not a mitochondrial or cytosolic isoenzyme. In addition potential glycosylation and phosphorylation sites are found, but there is no indication that glycosomal MDH is modified secondarily, so this information is ignored.
Toni also collects the MDH record from the same database and learns that the active site signature is as follows: [LIVM]-T-[TRKMN]-L-D-x(2)-R-[STA]-x(3)-[LIVMFY] [D and R are the active site residues] This signature recognizes all MDHs available so far and Toni is pleased to see that even his glycosomal MDH conforms to this signature.
Toni remembers that a BLAST search of protein sequences has a much higher signal to noise ratio, than a corresponding search of nucleic acids. So he decides to run a BLAST search of his Phytomonas MDH protein as query sequence against a protein database. This way he hopes to find all relevant malate dehydrogenases sequences available in the protein databank. Although he could have used the translated GenPep database of Genbank, he prefers to search the non-redundant SWISSPROT database. Although this is a secondary or derived database, it has been thoroughly checked by scientists for redundancy and correctness of all the included sequence information. Moreover, this database is extensively annotated and easy to search because of the similar locus names for related proteins. For his search he decides to use the BLAST server at the NCBI .
(Paste your sequence into the sequence submission window and then select the options BLASTP and the SWISSPROT and start the search).
Sequences producing High-scoring Segment Pairs: Score P(N) N sp|P04636|MDHM_RAT MALATE DEHYDROGENASE, MITOCHONDRIAL ... 279 4.9e-86 5 sp|P00346|MDHM_PIG MALATE DEHYDROGENASE, MITOCHONDRIAL 277 7.2e-86 5 sp|P08249|MDHM_MOUSE MALATE DEHYDROGENASE, MITOCHONDRIAL ... 275 9.2e-86 5 sp|P06994|MDH_ECOLI MALATE DEHYDROGENASE 249 1.5e-83 5 sp|P25077|MDH_SALTY MALATE DEHYDROGENASE 247 2.7e-83 5 sp|P46487|MDHM_EUCGU MALATE DEHYDROGENASE, MITOCHONDRIAL ... 255 3.2e-81 4 sp|P48364|MDH_VIBS5 MALATE DEHYDROGENASE 236 7.3e-81 5 sp|P17783|MDHM_CITVU MALATE DEHYDROGENASE, MITOCHONDRIAL ... 249 1.3e-78 5 sp|P44427|MDH_HAEIN MALATE DEHYDROGENASE 236 2.0e-78 5 sp|P37226|MDH_PHOS9 MALATE DEHYDROGENASE 227 9.4e-78 5 sp|P19446|MDHG_CITVU MALATE DEHYDROGENASE, GLYOXYSOMAL PR... 256 2.1e-77 5 sp|P17505|MDHM_YEAST MALATE DEHYDROGENASE, MITOCHONDRIAL ... 247 2.6e-76 4 sp|P46488|MDHG_CUCSA MALATE DEHYDROGENASE, GLYOXYSOMAL PR... 249 4.7e-76 5 sp|P37228|MDHG_SOYBN MALATE DEHYDROGENASE, GLYOXYSOMAL PR... 253 4.7e-76 5 sp|P32419|MDHP_YEAST MALATE DEHYDROGENASE, PEROXISOMAL 210 1.9e-65 5 sp|P22133|MDHC_YEAST MALATE DEHYDROGENASE, CYTOPLASMIC 94 1.1e-48 7 sp|P37227|MDHM_SCHMA MALATE DEHYDROGENASE, MITOCHONDRIAL 263 3.8e-38 2 sp|P49814|MDH_BACSU MALATE DEHYDROGENASE 88 4.1e-08 3
Since a BLAST search of protein sequences has a better signal to noise ratio than a similar search of nucleic acid sequences, he now gets a much cleaner result. Now only homologous MDH sequences are reported all with highly relevant scores. Interestingly there are some peroxisomal and glyoxysomal sequences, together with mitochondrial and cytosolic ones. Many of the sequences that were found in the nucleic acid search against GenBank are not found this time because partial sequences and multiple records for one and the same sequence are not included in the SwissProt database.
Toni now decides to study the evolutionary relationship of his Phytomonas enzyme to the other malate dehydrogenases. For this purpose he needs to prepare a multiple alignment of the most relevant MDH protein sequences, followed by a distance matrix and a phylogenetic tree. To retrieve all homologous protein sequences he connects to the SwissProt database using the WWW protein server EXPASY at the University of Geneva in Switzerland
From the last BLAST table shown above you select each time an accession code to retrieve the corresponding sequence, which you then paste into a word-processor document in order to produce a single file with many sequences, one after the other.
Toni is now going to use his file with MDH sequences for the construction of a multiple alignment with the program ClustalW. However, before he can use this it has to be transformed into a file of the right format. (one of the problems of the use of freely available software is that each program and each database use a different file format. ClustalW recognizes several file formats, but the server that Toni is going to use accepts only Pearson/Fasta format. Thus he needs to reformat his sequences. with all the sequences in Pearson/Fasta format. This can easily be done manually, because this format is the simplest format there is. However there exist sequence editors such as SeqApp or SeqPup and the sequence format conversion utility Readseq that do this automatically. Toni decides to use the last possibility because this is the quickest. He connects to the Readseq server at the NIH and reformats his file to a Pearson/Fasta sequence file. Now he adds his own Phytomonas sequence and another two trypanosome MDH sequences that have become available via a colleague, to the file and uses this final Pearson/Fasta file as input material for the construction of the multiple alignment.
He contacts the ClustalW server at the Baylor College in Houston and pastes the Pearson/Fasta file into the sequence window. Then he submits the data to the server using all default settings. Then he sits back and waits for the result to be sent back to him. This takes a few minutes. Here is part of the output that he received:
Click here to see the entire output of ClustalW
Encouraged by the nice alignment and the rapid progress of his project (he has been working on this project for only 30 minutes now) Toni decides (albeit a bit prematurely) to prepare a nice alignment suitable for the publication he wants to write on this project. Therefore he has to convert CLUSTALW alignment in Pearson/Fasta format that was attached to the end of the ClustalW alignment to a MSF format. This is done with the Readseq server at the NIH. , which by the way also allows the creation of Pretty Print alignments ready for publication. The newly created MSF file is now pasted into the BOXSHADE utility at the WWW server of the Swiss Institute for Experimental Cancer Research.
Here is an example of the output:

Toni recalls that the two MDH isoenzymes in Phytomonas have a different localization. One is present in the glycosomes and the other in the mitochondrion. So he thinks it may be worthwile to search his protein sequence for the presence of protein-sorting signals. He sends his sequence to the PSORT server at GenomeNet in Japan and indeed this server detects a C-terminal SKL sequence a potential peroxisomal location for his enzyme. The output of the PSORT server is shown here
Now that Toni has prepared a multiple alignment ready for publication he also wants to prepare an evolutionary tree of his enzyme and the related MDHs. He decides to use the DARWIN (Data Analysis and Retrieval With annotated Nucleic acid and protein sequences) WWW server at the Eidgenossische Technische Hochshule in Zurich. He selects the AllAll utility of the DARWIN server and as input he gives his own protein sequence together with the two bruceei sequences and the SWISSPROT accession codes for the other malate dehydrogenases which he wants to include (see example). As output options he selects the the PAM data and the unrooted and rooted phylotrees, as well as the multiple alignment.
(Copy and paste the entire example file into the sequence submission window of the AllAll server of Darwin. Type your email address and select the options you want attached to yoiur output and then submit the data. Please note that this server accepts a mixed input from your own sequences and accession numbers an/or locus names from the SwissProt database. Also not thazt this server creates its own multiple alignment, from which it creates a phylogenetic tree).
The resulting PAM-distance matrix, multiple alignment and rooted and unrooted phylogenetic trees are returned to him within a few minutes by email in the form of plain text as well as a Postscript file. On his Macintosh he removes all non-relevant information from his email message and then prints the postscript file. (You can either print the file to a postscript printer via a utility as the Laserwriter Utility for the Macintosh, or open the file from inside a postscript reader, such as MacGS. The latter possibility allows you to save it as a PICT file that can be used in combination with graphics programs). This is how the DARWIN output looks like:
The Phytomonas gMDH (sequence a) is most related to the T. brucei gMDH (pam distance= 37), but only distantly related to the T. brucei mitochondrial MDH (pam distance = 87). This suggests that the two MDHs in trypanosomatids may have a different origin and this may indicate that at least one of the two entered the trypanosome by horizontal transfer. One of the problems of the MDH alignment is that the sequences have different lengths. In a pairwise distance comparison, extensions present in one pair but absent in another, may bias the distance matrix. Therefore Toni now decides to remove all the indels and extension in order to circumvent this potential pitfall.
He imports the original MSF alignment into MS Word and deletes all the indels and extensions (using the Alt-Shift keys in combination with the mouse button). This is how it looks now. He now converts them to a Pearson/Fasta format by sending the trimmed sequence file to the Readseq server at the NIH. He selects the Peason/Fasta format as output and as option 'remove gaps'. In the resulting Pearson/Fasta file he moves the Phytomonas and the two T. brucei glycosomal and mitochondrial sequences to the top of the file, so that they will appear as sequences a, b and c in the Darwin output. Then all lines with the species names are removed and the bare sequences, separated by commas, are then submitted to the DARWIN AllAll server.
This is how this new DARWIN output should look:
However Toni did not succeed to get a reply within a reasanable delay! (Try it out yourself, maybe you have more luck now).
So Toni decides to send his alignment file to the ClustalW server in Kyoto in Japan. This server does not only align the sequences, as was done above, but it also returns a neighbor-joining tree from which he can create his tree with a program like Treeview on the Mac (or Treetool on a Unix machine).
This is the output from the Kyote
server
The treefile is at the bottom of the alignment. he cut and pastes the
treefile into a swall word document that he saves as text only
document and imports this intoo the Treeview program.
This if the figure he drew himself with
Treeview
It is obvious that the glycosomal, glyoxysomal and peroxysomal sequences are intermixed with the mitochondrial sequences. Apparently the microbody sequences arose several times in parallel from the transfer of a mitochondrial MDH to the microbodies. This is also confirmed by the deep branching of the tree suggesting that most branching points represent gene duplication and relocation rather than speciation.
NB: Final conclusions can only be drawn after a thorough phylogenetic analysis incuding several different methods together with bootstrap analysis. However , Toni can already start writing his manuscript.