The Phytomonas MDH gene:

A molecular biology project carried out via the World-Wide Web

by Fred Opperdoes

ICP-TROP, Brussels, Belgium


Introduction

Antonio, a post-doc from Argentina in the TROP Unit, has purified both the glycosomal and mitochondrial malate dehydrogenase enzymes from Phytomonas. As the last part of his project he wants to spent a few months on the cloning and sequencing of the gene coding for the glycosomal isoenzyme. While very busy sequencing he has little time to analyse his data in any detail, so he decides to keep this to the end of his stay, because with the help of the World Wide Web he knows he can do the analyses within an afternoon just before leaving Brussels. So by the time he has a complete sequence he connects his computer to the web and tries to get all the information required to confirm the identity of his sequence. He only uses the software available on the Internet. All he needs is a modem, a telephone line number that gives direct access to the Internet, a WWW client program such as Netscape, Explorer or Mosaic and a word-processor that allows him to cut and paste text.


NB: Explanation of text colours:


At this stage, if you are not on line, connect to the World Wide Web by opening your Web Browser (such as Netscape) and open this document from within your browser. Then open a second browser window to carry out the excercises, while you use the first window for reading of the text.

Click here to see the open reading frame (ORF) Toni sequenced:

To find out whether or not the Phytomonas ORF has any homology to other nucleic acid sequences in the GenBank database he performs a BLAST (Basic Local Alignment Sequence Tool) search using the WWW (World Wide Web) server of the NCBI (National Center for Biotechnology Information) at the NIH (National Institutes of Health) in Bethesda, USA) .

(Click on the above internet address to open a BLAST sequence submission window at NCBI. Then copy the entire sequence by using Copy and Paste in the Edit menu from this window into the Blast sequence submission window. Use all standard settings. Select the complete GenBank database (nr). Submit the sequence to NCBI and wait for the result to arrive)

Here is a sample output of the result of the search that is returnd within a few minutes:

(Compare your own output with the one from August 1997 to find out whether and how many new sequences have been added to the database since that date)

Since the BLAST search reports homology with many malate dehydrogenases, Toni has good hopes that he has cloned the correct gene. He now takes a look at one of the reported sequences from the database. (Click on one of the accession numbers, for instance gb|AF004202|AF004202 in the BLAST output and the following lines of the GenBank record should appear on the screen:)


LOCUS       AF004202      864 bp    DNA             BCT       01-AUG-1997
DEFINITION  Escherichia coli malate dehydrogenase (mdh) gene, partial cds.
ACCESSION   AF004202
NID         g2289314
KEYWORDS    .
SOURCE      Escherichia coli.
  ORGANISM  Escherichia coli
            Eubacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
            Escherichia.
REFERENCE   1  (bases 1 to 864)
  AUTHORS   Pupo,G.M., Karaolis,D.K.R., Lan,R. and Reeves,P.R.
  TITLE     Evolutionary relationships among pathogenic and non-pathogenic
            Escherichia coli strains inferred from multilocus enzyme
            electrophoresis and mdh sequence studies
  JOURNAL   Infect. Immun. 65 (7) (1997) In press
REFERENCE   2  (bases 1 to 864)
  AUTHORS   Pupo,G.M., Karaolis,D.K.R., Lan,R. and Reeves,P.R.
  TITLE     Direct Submission
  JOURNAL   Submitted (15-MAY-1997) Microbiology, University of Sydney,
            Department of Microbiology, G08 Maze Cresent, Sydney, N.S.W. 2006,
            Australia
-----
(Text truncated here)


Since the above record provides little more information other than the mention of mdh or malate dehydrogenase and since the paper mentioned in the record is still in press and thus not yet available in the Institute's library, Toni decides to read the abstract from this on line. He knows that in general sequences are submitted to the database at the time a paper is submitted to a journal and that Genbank releases the sequence by the time the paper is published. Using the GenBank ID (AF004202) as query, he now accesses the Entrez server at the NCBI to search the paper in the molecular biology subset of Medline, the database of biomedical literature information. (Carry out this search. It will depend on how much time has elapsed after the submission date of the above GenBank record whether or not you will have a positive result).


Here output from Entrez


From the information that Toni obtained so far, he concludes that there are very strong indications that the Phytomonas ORF he got indeed codes for a MDH or MDH-related protein.

Toni now decides to translate the ORF into a protein sequence. On the Web at the WWW server at the EBI (European Bioinformatics Institute, an outstation of the EMBL located in Hinxton Hall near Cambridge, UK) there is the PROTEIN MACHINE utility allowing the translation of nucleotide sequences. and submits the sequence.

(Submit the sequence by pasting the nucleotide sequence into the sequence window of the Protein Machine and then submit it).

After a few seconds the server returns the following :


Translation:

You are translating a DNA string of length=966

Which should yield a protein of 322 amino acids.

MAHVCVVGAAGGVGQALSLLLTRSLPYGSTLSLYDVVGAPGVAADLSHID
NAGVTVKFAAGKIGVTRDPALAALATGVDVFVFVAGGPIMPGMKRDDLFN
STAGIVLDLVMTCASSSPKAMFCMISNPVNSTVPIAAEVLKKLGVYNKNR
LLGVTRLDMLRATRFINEARMPLVVDRVPVVGGHSDNSIVPLFHQLQGPL
PPKEQLDKITLRVQSAAYEVIDAKGGRGSATLAMGEAGARFVLDVVKGLT
GASNPLVYAYVDTDGQSESEFLAIPVILGKSGIERRLPIGPMTESEKKLV
DVAISIVKKNIEKGKEFALSKL

The complete output from the ProteinMachine is here


To be sure that his translated protein really is a malate dehydrogenase he scans the protein sequence against the PROSITE database. This is a dictionary of enzyme patterns and motifs, (such as active site signatures, potential phosphorylation and glycosylation sites, etc) used for determining functions of proteins. PROSITE is available from the University Medical Center in Geneva.

(Access the Prosite server and do two things: scan the protein sequence against the database to see if it is recognised as a MDH and retrieve the MDH record in the database to read more about the specific characteristics of this enzyme).

Indeed his protein sequence is recognised as a MDH (click here to see the result). It recognizes the active site signature 154-166 VTRLDMLRATRFI, as well as a microbody targetting signal 320-322 SKL, confirming that he has cloned the correct glycosomal MDH and not a mitochondrial or cytosolic isoenzyme. In addition potential glycosylation and phosphorylation sites are found, but there is no indication that glycosomal MDH is modified secondarily, so this information is ignored.

Toni also collects the MDH record from the same database and learns that the active site signature is as follows: [LIVM]-T-[TRKMN]-L-D-x(2)-R-[STA]-x(3)-[LIVMFY] [D and R are the active site residues] This signature recognizes all MDHs available so far and Toni is pleased to see that even his glycosomal MDH conforms to this signature.


Toni remembers that a BLAST search of protein sequences has a much higher signal to noise ratio, than a corresponding search of nucleic acids. So he decides to run a BLAST search of his Phytomonas MDH protein as query sequence against a protein database. This way he hopes to find all relevant malate dehydrogenases sequences available in the protein databank. Although he could have used the translated GenPep database of Genbank, he prefers to search the non-redundant SWISSPROT database. Although this is a secondary or derived database, it has been thoroughly checked by scientists for redundancy and correctness of all the included sequence information. Moreover, this database is extensively annotated and easy to search because of the similar locus names for related proteins. For his search he decides to use the BLAST server at the NCBI .

(Paste your sequence into the sequence submission window and then select the options BLASTP and the SWISSPROT and start the search).


The most relevant part of the Blast output reads as follows:

Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

sp|P04636|MDHM_RAT   MALATE DEHYDROGENASE, MITOCHONDRIAL ...   279  4.9e-86   5
sp|P00346|MDHM_PIG   MALATE DEHYDROGENASE, MITOCHONDRIAL       277  7.2e-86   5
sp|P08249|MDHM_MOUSE MALATE DEHYDROGENASE, MITOCHONDRIAL ...   275  9.2e-86   5
sp|P06994|MDH_ECOLI  MALATE DEHYDROGENASE                      249  1.5e-83   5
sp|P25077|MDH_SALTY  MALATE DEHYDROGENASE                      247  2.7e-83   5
sp|P46487|MDHM_EUCGU MALATE DEHYDROGENASE, MITOCHONDRIAL ...   255  3.2e-81   4
sp|P48364|MDH_VIBS5  MALATE DEHYDROGENASE                      236  7.3e-81   5
sp|P17783|MDHM_CITVU MALATE DEHYDROGENASE, MITOCHONDRIAL ...   249  1.3e-78   5
sp|P44427|MDH_HAEIN  MALATE DEHYDROGENASE                      236  2.0e-78   5
sp|P37226|MDH_PHOS9  MALATE DEHYDROGENASE                      227  9.4e-78   5
sp|P19446|MDHG_CITVU MALATE DEHYDROGENASE, GLYOXYSOMAL PR...   256  2.1e-77   5
sp|P17505|MDHM_YEAST MALATE DEHYDROGENASE, MITOCHONDRIAL ...   247  2.6e-76   4
sp|P46488|MDHG_CUCSA MALATE DEHYDROGENASE, GLYOXYSOMAL PR...   249  4.7e-76   5
sp|P37228|MDHG_SOYBN MALATE DEHYDROGENASE, GLYOXYSOMAL PR...   253  4.7e-76   5
sp|P32419|MDHP_YEAST MALATE DEHYDROGENASE, PEROXISOMAL         210  1.9e-65   5
sp|P22133|MDHC_YEAST MALATE DEHYDROGENASE, CYTOPLASMIC          94  1.1e-48   7
sp|P37227|MDHM_SCHMA MALATE DEHYDROGENASE, MITOCHONDRIAL       263  3.8e-38   2
sp|P49814|MDH_BACSU  MALATE DEHYDROGENASE                       88  4.1e-08   3


The complete BLAST output is is available as well.

Since a BLAST search of protein sequences has a better signal to noise ratio than a similar search of nucleic acid sequences, he now gets a much cleaner result. Now only homologous MDH sequences are reported all with highly relevant scores. Interestingly there are some peroxisomal and glyoxysomal sequences, together with mitochondrial and cytosolic ones. Many of the sequences that were found in the nucleic acid search against GenBank are not found this time because partial sequences and multiple records for one and the same sequence are not included in the SwissProt database.


Toni now decides to study the evolutionary relationship of his Phytomonas enzyme to the other malate dehydrogenases. For this purpose he needs to prepare a multiple alignment of the most relevant MDH protein sequences, followed by a distance matrix and a phylogenetic tree. To retrieve all homologous protein sequences he connects to the SwissProt database using the WWW protein server EXPASY at the University of Geneva in Switzerland

From the last BLAST table shown above you select each time an accession code to retrieve the corresponding sequence, which you then paste into a word-processor document in order to produce a single file with many sequences, one after the other.

Toni is now going to use his file with MDH sequences for the construction of a multiple alignment with the program ClustalW. However, before he can use this it has to be transformed into a file of the right format. (one of the problems of the use of freely available software is that each program and each database use a different file format. ClustalW recognizes several file formats, but the server that Toni is going to use accepts only Pearson/Fasta format. Thus he needs to reformat his sequences. with all the sequences in Pearson/Fasta format. This can easily be done manually, because this format is the simplest format there is. However there exist sequence editors such as SeqApp or SeqPup and the sequence format conversion utility Readseq that do this automatically. Toni decides to use the last possibility because this is the quickest. He connects to the Readseq server at the NIH and reformats his file to a Pearson/Fasta sequence file. Now he adds his own Phytomonas sequence and another two trypanosome MDH sequences that have become available via a colleague, to the file and uses this final Pearson/Fasta file as input material for the construction of the multiple alignment.

He contacts the ClustalW server at the Baylor College in Houston and pastes the Pearson/Fasta file into the sequence window. Then he submits the data to the server using all default settings. Then he sits back and waits for the result to be sent back to him. This takes a few minutes. Here is part of the output that he received:


Click here to see the entire output of ClustalW


Encouraged by the nice alignment and the rapid progress of his project (he has been working on this project for only 30 minutes now) Toni decides (albeit a bit prematurely) to prepare a nice alignment suitable for the publication he wants to write on this project. Therefore he has to convert CLUSTALW alignment in Pearson/Fasta format that was attached to the end of the ClustalW alignment to a MSF format. This is done with the Readseq server at the NIH. , which by the way also allows the creation of Pretty Print alignments ready for publication. The newly created MSF file is now pasted into the BOXSHADE utility at the WWW server of the Swiss Institute for Experimental Cancer Research.

Here is an example of the output:



Toni recalls that the two MDH isoenzymes in Phytomonas have a different localization. One is present in the glycosomes and the other in the mitochondrion. So he thinks it may be worthwile to search his protein sequence for the presence of protein-sorting signals. He sends his sequence to the PSORT server at GenomeNet in Japan and indeed this server detects a C-terminal SKL sequence a potential peroxisomal location for his enzyme. The output of the PSORT server is shown here


.

Now that Toni has prepared a multiple alignment ready for publication he also wants to prepare an evolutionary tree of his enzyme and the related MDHs. He decides to use the DARWIN (Data Analysis and Retrieval With annotated Nucleic acid and protein sequences) WWW server at the Eidgenossische Technische Hochshule in Zurich. He selects the AllAll utility of the DARWIN server and as input he gives his own protein sequence together with the two bruceei sequences and the SWISSPROT accession codes for the other malate dehydrogenases which he wants to include (see example). As output options he selects the the PAM data and the unrooted and rooted phylotrees, as well as the multiple alignment.

(Copy and paste the entire example file into the sequence submission window of the AllAll server of Darwin. Type your email address and select the options you want attached to yoiur output and then submit the data. Please note that this server accepts a mixed input from your own sequences and accession numbers an/or locus names from the SwissProt database. Also not thazt this server creates its own multiple alignment, from which it creates a phylogenetic tree).

The resulting PAM-distance matrix, multiple alignment and rooted and unrooted phylogenetic trees are returned to him within a few minutes by email in the form of plain text as well as a Postscript file. On his Macintosh he removes all non-relevant information from his email message and then prints the postscript file. (You can either print the file to a postscript printer via a utility as the Laserwriter Utility for the Macintosh, or open the file from inside a postscript reader, such as MacGS. The latter possibility allows you to save it as a PICT file that can be used in combination with graphics programs). This is how the DARWIN output looks like:


Click here to see the DARWIN output


The Phytomonas gMDH (sequence a) is most related to the T. brucei gMDH (pam distance= 37), but only distantly related to the T. brucei mitochondrial MDH (pam distance = 87). This suggests that the two MDHs in trypanosomatids may have a different origin and this may indicate that at least one of the two entered the trypanosome by horizontal transfer. One of the problems of the MDH alignment is that the sequences have different lengths. In a pairwise distance comparison, extensions present in one pair but absent in another, may bias the distance matrix. Therefore Toni now decides to remove all the indels and extension in order to circumvent this potential pitfall.

He imports the original MSF alignment into MS Word and deletes all the indels and extensions (using the Alt-Shift keys in combination with the mouse button). This is how it looks now. He now converts them to a Pearson/Fasta format by sending the trimmed sequence file to the Readseq server at the NIH. He selects the Peason/Fasta format as output and as option 'remove gaps'. In the resulting Pearson/Fasta file he moves the Phytomonas and the two T. brucei glycosomal and mitochondrial sequences to the top of the file, so that they will appear as sequences a, b and c in the Darwin output. Then all lines with the species names are removed and the bare sequences, separated by commas, are then submitted to the DARWIN AllAll server.

This is how this new DARWIN output should look:

However Toni did not succeed to get a reply within a reasanable delay! (Try it out yourself, maybe you have more luck now).

So Toni decides to send his alignment file to the ClustalW server in Kyoto in Japan. This server does not only align the sequences, as was done above, but it also returns a neighbor-joining tree from which he can create his tree with a program like Treeview on the Mac (or Treetool on a Unix machine).

This is the output from the Kyote server
The treefile is at the bottom of the alignment. he cut and pastes the treefile into a swall word document that he saves as text only document and imports this intoo the Treeview program.

This if the figure he drew himself with Treeview

It is obvious that the glycosomal, glyoxysomal and peroxysomal sequences are intermixed with the mitochondrial sequences. Apparently the microbody sequences arose several times in parallel from the transfer of a mitochondrial MDH to the microbodies. This is also confirmed by the deep branching of the tree suggesting that most branching points represent gene duplication and relocation rather than speciation.

NB: Final conclusions can only be drawn after a thorough phylogenetic analysis incuding several different methods together with bootstrap analysis. However , Toni can already start writing his manuscript.


Last updated: 28 September 1997.

created by :Fred Opperdoes