The enigma of the Leismanial GPD gene

A demonstration of a molecular biology project carried out entirely via the World-Wide Web

by Fred Opperdoes, ICP-TROP, Brussels, Belgium



Tomas, a visiting parasitologist from the Charles University in Prague, was working in our laboratory from January to June, 1995. He spent those 6 months sequencing a DNA fragment that contains two open reading frames (ORFs) of which he believed that one coded for the Leishmania mexicana NAD-dependent glycerol-3-phosphate dehydrogenase (GPD, EC.1.1.1.8), an enzyme that in Leishmania is associated with glycosomes as well as with mitochondria. While very busy sequencing, he had no time to analyse his data in any detail, but once he had returned to Prague he had plenty of time, but no access to all the sequence analysis software that was available in Brussels. So he decides to connect his computer to the World-Wide Web (WWW) to get further information about his sequence via the software available on the Internet. All he needed was a modem, a telephone line that gives access to the Internet, a WWW client program such as NETSCAPE and a word processing program that allows him to cut and paste text.

NB: Explanation of text colours:

  • Red: instructions to be followed by the student
  • Violet: results obtained by Tomas


At this stage, if you are not on line, connect to the World Wide Web by opening your Web Browser (such as Netscape) and open this document from within your browser. Then open a second browser window to carry out the excercises, while you use the first window for reading of the text.


In this project the following analyses are being carried out:


Before he starts his projects Tomas likes to collect some general information about the enzyme he intended to clone and sequence. So he decides to check out the Enzyme or EC database which holds general information about enzymes, their officials names and their EC numbers, the reactions they catalyse, the pathways they are involved in and all available protein sequences. Once in the Enzyme database he selects as EC number :1.1.1.8 and he gets the requested information. He checks out all the information available in the database and he decides to retrieve all the GPD sequences in the SwissProt database for future use.


He also wants to know whether there is already a 3-dimensional crystal structure available for a homologue of his enzyme. If available, this information is stored in the Brookhaven Protein Database or PDB. He searches the database, but there is not yet a glycerol-3-phosphate dehydrogenase structure available.

Now he can start his project.

This is the ORF he sequenced.


To find out whether the Leishmania ORF has any homology to other nucleic acid sequences in the GenBank database he performs a BLAST (Basic Local Alignment Sequence Tool) search using the server of the NCBI (National Center for Biotechnology Information) at the NIH (National Institutes of Health) in Bethesda, USA

(Click on the above internet address to open a BLAST sequence submission window at NCBI. Then copy the entire sequence by using Copy and Paste in the Edit menu from this window into the Blast sequence submission window. Use all standard settings. Select the complete GenBank database (nr). Submit the sequence to NCBI and wait for the result to arrive)

Here is the output of the research that is returned within a few minutes:


Although the BLAST search reports homology with some glycerol-3-phosphate dehydrogenases (it may be that there will be more GPDH sequences available at the time you do this search) there is also a Pseudomonas poly(3-hydroxy-butyrate) depolymerase A precursor gene as well. Tomas is a bit worried about the identity of his ORF and decides to retrieve the entire Pseudomonas record from the database for a closer visual inspection. He clicks on its accession number (U12977) and the following lines of the GenBank record appear on the screen:

(Click on one of the accession numbers, for instance (U12977) in your BLAST output and the following lines of the GenBank record should appear on the screen:)


LOCUS PLU12977 3704 bp DNA BCT 03-MAY-1995 DEFINITION Pseudomonas lemoignei poly(3-hydroxybutyrate) depolymerase A precursor (phaZ5) gene, complete cds, and glycerol-3-phosphate-dehydrogenase homolog, complete cds. ACCESSION U12977 KEYWORDS . SOURCE Pseudomonas lemoignei. ORGANISM Pseudomonas lemoignei Eubacteria; Proteobacteria; unclassified pseudomonads. REFERENCE 1 (bases 1 to 3704) AUTHORS Jendrossek,D., Frisse,A., Behrends,A., Andermann,M., Kratzin,H.D., Stanislawski,T. and Schlegel,H.G. TITLE Biochemical and molecular characterization of the Pseudomonas lemoignei polyhydroxyalkanoate depolymerase system JOURNAL J. Bacteriol. 177 (3), 596-607 (1995) MEDLINE 95138018 REFERENCE 2 (bases 1 to 3704) AUTHORS Jendrossek,D. TITLE Direct Submission JOURNAL Submitted (04-AUG-1994) Dieter Jendrossek, Institut fuer Mikrobiologie, Georg-August-Universitaet Goettingen, Grisebachstrasse 8, Goettingen, Goettingen 37077, Germany COMMENT NCBI gi: 531465 FEATURES Location/Qualifiers source 1..3704 /clone_lib="subclone plasmid pSN792" /organism="Pseudomonas lemoignei" CDS 197..1138 /note="similar to glycerol-3-phosphate-dehydrogenase, GenBank Accession Number U00039, and Swiss-Prot Accession Number P13706; NCBI gi: 531467" /codon_start=1 /transl_table=11 /translation="MTSCCGAVAKTRWRNVPDPCENTAYLPGHPLPAALKATADFSLA LDHVAQGDGLLIAATSVAGLRPLAQQLQGKAIPNLVWLCKGLEEGSGLLPHQVVREVL GTQLPAGVLSGPSFAQEVAQGLPCALVIAAEDAALRELVVAAVHGPAIRVYSSDDVVG VEVGGAVKNILAIATGILDGMSLGLNARAALITRGLAEITRLGIALGARAETFMGLAG VGDLILTCTGDLSRNRKVGLGLAQGKPLETIVTELGHVAEGVRCAAAVRNLAQQLQIE MPITNAVAGILFDGHSPRATVEQLLARHPRDESISAS" ----- (Text truncated here)

 

Although the first relevant line in the BLAST report mentions "Pseudomonas lemoignei poly(3-hydroxybutyrate) depolymerase A", this GenBank record contains the coding sequences of two cDNAs, one of which is indeed a GPD.


Because Tomas does not have the relevant publication mentioned in the GenBank record (J. Bacteriol. 177 (3), 596-607 (1995)) at hand and because he is too impatient to leave his desktop computer, he decides to read the abstract from the paper that describes the two Pseudomonas sequences on line. So he clicks on the MEDLINE identifier (95138018) in the record and reads the abstract that now appears on the screen.

From the information he obtained so far, Tomas concludes that there are very strong indications that the Leishmania ORF codes for a GPD, but he is intrigued by the fact that for his trypanosomatid GPD the highest degree of ientity is reported with bacterial sequences.


Now Tomas wants to scan a his DNA sequence for the presence of open reading frames. So he submits his nucleotide sequence to the NCBI open reading frame scanner to find all possible coding regions in all 6 reading frames.

(Submit the sequence by pasting the nucleotide sequence into the sequence window of the ORF scanner and then submit it).

The longest open reading frame is found starting at position 1 of his sequence. Only at the very end there is a stop codon.


To improve the sensitivity of his search Tomas now decides to translate the longest ORF into protein. For the translation he accesses the Protein Machine utility at the EBI allowing the translation of nucleotide sequences into protein and so he submits his nucleotide sequence.

(Submit the sequence by pasting the nucleotide sequence into the sequence window of the Protein Machine and then submit it).

After a few seconds the server returns the following output:


Translation:

You are translating a DNA string of length=1125.

Which should yield a protein of 375 amino acids.

 

He notes that there is a stopcodon just before the end of the sequence that he has submitted. So he removes the last 8 amino acids from the output and he saves the protein sequence somewhere on his clipboard for later use.


In order to be absolutely sure that his translated protein really is a glycerol-3-phosphate dehydrogenase he submits the sequence to the PROSITE database. This is a dictionary of enzyme patterns and motifs, (such as active site signatures, potential phosphorylation and glycosylation sites, etc). This database is used for determining functions for unknown protein sequences. PROSITE is available from the University Medical Center in Geneva.

(You access the Prosite server and then do two things: (i) scan the protein sequence against the database to see if it is recognised as a GPD and (ii) retrieve the corresponding GPD record in the database to read more about the specific characteristics of this enzyme).

It turns out that his sequence is not recognized by the PROSITE database. The entry for glycerol-3-phosphate dehydrogenase mentions the following consensus sequence ([DN]-[LIV]3-F-X-[LIV]-P-H-Q-F). The corresponding peptide in the Leishmania GPD reads EIILFVIPTQF. There are resemblances but there are also 2 differences. Tomas notes however that the consensus pattern in PROSITE is based on a limited number of yeast and animal sequences, while his sequence belongs to a protist. So he does not worry too much and continues his project.


Because a BLAST search of protein sequences has a much higher signal to noise ratio, than a corresponding search of nucleic acids, he decides to run a BLAST search with the Leishmania protein as query sequence. This way he hopes to find many more glycerol-3-phosphate dehydrogenases sequences in the protein databank. Although he could use the translated GenPep database of Genbank, he prefers to search the non-redundant SWISSPROT database. Although this is a derived database, it has been thoroughly checked by scientists for redundancy and correctness of included sequence information. Moreover, this database is extensively annotated.

For his search he decides to use the BLAST server at the NCBI .

(Paste your sequence into the sequence submission window and then select the options "BLASTP" , the "SWISSPROT" database and the PAM 250 matrix and start the search).

The Dayhoff PAM 250 matrix is more appropriate for the detection of distantly related proteins.

The Blast output is returned within a few seconds

Because of the higher signal to noise ratio of a protein search, now many more homologous sequences are reported. In fact 15 available glycerol-3-phosphate sequences in the SWISSPROT database have been found. The Pseudomonas sequence does not show up in the search, because remember that SWISSPROT is a derived database and as such the latter sequence was not yet included in the latest release (release 31, available at the time of this search (23 September 1995). The Pseudomonas nucleotide sequence only appeared in GenBank on May 3rd, 1995.

If interested you may try to find out why the Pseudomonas sequence was never included in the SwissProt database.

There are many genome sequencing projects going on that produce large amounts of expressed sequence tags or ESTs. The information is becoming so vast that EST sequence information is stored in separate subfile of the Genbank database and these subfiles are not searched unless specifically requested. Tomas knows that there are three genome sequencing initiatives that focus on the trypanosomatidae that have generated thousands of ESTs and he wants to see whether there is not already an EST of a trypanosomatid for his enzyme available. So he decides to do scan his protein sequence against this EST database. He connects again to the BLAST server at the NCBI and now uses the Tblastn program that uses a protein sequence to scan a nucleotide database against all six reading frames and he select the dbest subfile. Indeed, the output shows several trypanosome ESTs with a significant degree of identity to his own protein sequence. Since there are only some 1000 trypanosome ESTs available, this suggests that the mRNA coding for this protein in T. rhodesiense must be a relatively abundant one. The absence of a Leishmania sequence may suggets two things. Either its mRNA is less abundant in this organism or there are less Leismania ESTs available. Both are probably true.

You may want to collect all these ESTs and align them with each other to create a consensus sequence to get an impression on the reliability of single pass cDNA sequencing in these genome projects.


Tomas recalls that the enzyme GPD in Leishmania has a double location. It is present in the glycosomes and in the mitochondria as well. So he thinks it may be worthwile to search his protein sequence for the presence of protein sorting signals. Thus he sends his sequence to the PSORT utility of the Nakai server at GenomeNet in Japan and indeed this server reports a potential peroxisomal location for his enzyme. Since glycosomes and peroxisomes belong to the same family of organelles he is very pleased with this result. However, PSORT reports no mitochondrial transit sequence. The output of the PSORT server is shown here:



Since glycosomal proteins in general have a high net positive charge at neutral pH and pIs that range from 8 to 11, he decides to calculate a theoretical pI for his protein. He submits his sequence to the pI tool of the expasy server and the server calculates a pI of 8.97 and a molecular mass of 39271.89 for his protein.


For the completenes of his analyses he decides to determine the amino-acid composition of his enzyme and some statistical output as well. For this he uses the SBDS server in the UK. This server gives him the following result. More elaborate statistical analyses on the nature of his protein are made by the SAPS utility of the ISREC server in Lausanne. This is the output.


He remembers that the Crithidia GPD easily binds to an agarose columns (Bacchi et al, 1974), that the T. brucei GPD elutes as one of the last enzymes from a hydrophobic interaction chromatography column (Misset et al, 1987), that the the enzyme has been suggested to be membrane associated by McLaughlin (1984) and that Pearson and coworkers (1995) accidentally cloned the T. brucei homologue because they thought it was a membrane enzyme. Therefore, he wants to see the hydrophobicity profile of the enzyme and so he connects to the Protscale utility on the Expasy server in Switzerland. This server allows the creation of many different types of profile. He submits his protein sequence and receives a Kyte & Doolitle profile. This profile does not give him any clear clue as to its alleged hydrophobic properties.

(Try on this server other profiles as well).


Puzzled by the fact that in both the BLAST searches (in SwissProt as well as in GenBank) bacterial sequences score much better than the eukaryotic GPD sequences, while using the Eukaryotic protist GPD sequence of Leishmania as the query sequence, Tomas now decides to study the evolutionary relationships of the glycerol-3-phosophate dehydrogenases. For this purpose he needs to prepare a multiple alignment of all available protein sequences, followed by a corrected distance matrix and a phylogenetic tree.

First he sends his sequence to the PredictProtein server at the EMBL. This server compares his sequence with all sequences in the SwissProt database and prepares a multiple alignment of all homologous sequences in MSF format. However, PredictProtein is restrictive and gives him only 6 proteins homologous to his, rather than the 15 that were found with the BLAST program. This is also an indication that GPD rapidly evolves and that GPDs will have low precentages of pair-wise identity.

Since the PredictProtein server allows the prediction of secondary structure of proteins based on the results of a multiple sequence alignment of homologues in the database and since there is not yet any GPD crystal structure available, he decides to try a structure prediction. He resubmits his sequence and asks for a structure prediction in MSF format. This result is very useful since it not only gives him a multiple alignment of the homologous sequences, but also a structure prediction of the query sequence alone, as well for the complete alignment. In general these predictions are correct by some 80%. The result of this analysis may later help him in manually correcting the multiple alignment that was prepared with ClustalW.

So now Tomas feels that he should decide for himself which sequences to include in his analysis.

Using the Enzyme server at EXPASY at the University of Geneva in Switzerland, Tomas now downloads the following sequences (he actually had done this already at the start of his project:

SWISS-PROT:
Q00055, GPD1_YEAST; P41911, GPD2_YEAST; P34517, GPDA_CAEEL; 
P52425, GPDA_CUPLA; P13706, GPDA_DROME; P07735, GPDA_DROVI; 
P37606, GPDA_ECOLI; P43798, GPDA_HAEIN; P21695, GPDA_HUMAN; 
P13707, GPDA_MOUSE; P08507, GPDA_RABIT; P40716, GPDA_SALTY; 
P21696, GPDA_SCHPO;

 

The file he obtains from the server by FTP (File Transfer Protocol) contains a lot of text information that he does not need, Moreover, before he can use this file it has to be transformed into a file of appropriate format. One of the problems of the use of freely available software is that each program and each database uses a different file format. ClustalW recognizes several file formats, but the server that Tomas is going to use accepts only Pearson/Fasta format. Thus he needs to reformat his sequences. Removal of text and reformatting of the sequences can easily be done manually, because this format is the simplest format there is. However there exist sequence editors such as GDE, SeqApp or SeqPup and the sequence format conversion utility Readseq that do this automatically. So he decides to reformat with the help of the Readseq server the file into a Fasta file that can be used for his multiple alignment. He connects to the Readseq server at the NIH and selects the Pearson/Fasta output format file.

(Please note that this utility also allows the presentation of publishable sequence alignments).

To this file he now adds his own sequence and the GenBank-translated Pseudomonas GPD sequence. He uses this final file as input for the construction of a multiple alignment.

He contacts the ClustalW server at the Baylor College in Houston and pastes the Pearson/Fasta file into the sequence window. Then he submits the data to the server using all default settings. Then he sits back and waits for the result to be sent back to him. This takes a few minutes. Here is part of the output that he received: Clustalw output

With the help of the alignment you should be able to find out why the Pseudomonas sequence was never included in the SwissProt database. (Answer: When you compare the Pseudomonas DNA and protein sequence with other GPD sequences. You'll see that the Protein sequence is truncated due to a sequencing error that leads to a frameshift in the 3' end of the ORF). This illustrates the usefulness of SwissProt. Erroneous or incomplete sequences are not included.

Interestingly the GPD2_yeast sequence has a long N-terminal extension, which is absent from the yeast1 and most other GPDs. This extension is recognized by the Psort server as a mitochondrial transit sequence (try it out). It has been reported that the GPD2 gene product is cytosolic under conditions of glucose repression, but this is a condition where mitochondria are not active. It would be interesting to see what would be the localization of this enzyme under glucose depletion.


Encouraged by the nice alignment and the rapid progress of his project Tomas decides (albeit a bit prematurely) to prepare a nice alignment suitable for the publication he wants to write on this project. Therefore he has to convert the CLUSTALW alignment (a Pearson/Fasta format was attached to the end of his ClustalW output) to a MSF format. This is done again with the Readseq server at the NIH. (By the way, Readseq allows the creation of Pretty Print alignments ready for publication.

(Try with PrettyPrint as output which kind of presentation of the alignment you would prefer).

The newly created MSF file is now pasted into the BOXSHADE utility at the WWW server of the Swiss Institute for Experimental Cancer Research.

Here is an example of the output (with default settings,39k)


For the preparation of a phylogenetic tree Tomas decides to use the DARWIN server. From the last BLAST table he copies all the lines with GPD accession numbers into a MS Word document and cuts from this (using the combined Alt plus Shift keys and the mouse) the relevant column with the accession numbers. The list of accession numbers together with his Leishmania protein sequence and the Pseudomonas sequence are submited to the AllAll utility of the DARWIN (Data Analysis and Retrieval With annotated Nucleic acid and protein sequences) WWW server at the Eidgenossische Technische Hochshule in Zurich. He selects the AllAll utility of the server and as input he gives thez file he just created with a word processor. The resulting PAM-distance matrix, multiple alignment and rooted and unrooted phylogenetic trees are returned to him by email within a few minutes in the form of plain text and postscript files. He cuts out the postscript file containing the trees from his email message and prints them on his postscript printer, using the "LaserWriter Utility" (an alternative way to see the ps file is to open it from inside a postscript reader). This is how the DARWIN output should look:


Here should appear the Darwin output, Tomas did not succeed to get a reply within a reasanable delay! (Try it out yourself, maybe you have more luck now).


So Tomas now decides to send his alignment file to the ClustalW server in Kyoto in Japan. This server does not only align the sequences, as was done above, but it also returns a neighbor-joining tree from which he can try to create his tree with a program like Treeview on the Mac (or Treetool on a Unix machine).

This is the output from the Kyote server

This if the figure he drew himself with Treeview


From the topology of the trees it is immediately obvious to Tomas that the GPD of the protozoan Leishmania is more closely related to the bacterial sequences than to any of the eukaryotic GPDs. A careful inspection of the position of indels in the Leishmania sequence confirms this view. Apparently Tomas has discovered an event of horizontal gene transfer and this discovery took him less than 2 hour after the completion of his DNA sequencing project. To find the explanation for this exciting observation and to write the discussion for his paper, there are no WWW servers available yet. This is a job that will certainly take much more than 2 hours.


The manuscript*) describing the above results has been published in Molecular and Biochemical Parasitology. A copy can be obtained from Fred Opperdoes.

*) Kohl, L., Drmota, T., DoThi, C.D., Callens, M., Van Beeumen, J., Opperdoes, F.R. and Michels, P.A.M. (1995) NAD-linked glycerol-3-phosphate dehydrogenase of Trypanosoma brucei and Leishmania mexicana. Cloning and characterization of the genes and expression of the trypanosomal protein in Escherichia coli. Molecular and Biochememical Parasitology, 76, 159-173.


Last updated: 23 September, 1997.

created by :Fred Opperdoes