From the cyberbiochemist which appeared in The Biochemist of October/November
1996
Birkbeck College and Venus
Internet Ltd, London
The 'central dogma of molecular biology' describes how genetic information is stored as a series of bases along a DNA polymer, and how base triplets are transcribed to give the 20 'naturally occurring' amino acids. The sequence of the amino acids, in turn, determines the precise three-dimensional structure and the function of the protein product. It is probably not an exaggeration to say ,. that the whole of molecular biology lies behind the question of how gene and protein sequences determine structure, and how structure determines function.
The amount of sequence information available has grown exponentially over the last few years as a result of the increasing sophistication of cloning and sequencing techniques, as well as the huge efforts being put into the human and other genome projects. Fortunately, so far (at least) this information is available free of charge, and most of it can be viewed and downloaded over the Internet. Gene sequences are collected into the International Nucleotide Sequence Database Collaboration, which comprises GenBank in the USA, the EMBL database in Europe and DDBJ in japan. The organisations exchange data on a daily basis and so are - to all intents and purposes - equivalent. Version 95 of GenBank, released in June 1996, contains 835000 sequences, which together contain over 500 million bases! This illustrates another reason for the growth of the Internet as a distribution medium for sequence information. This single database will occupy over 2 Gb disk space; not every institution can set aside adequate resources to store this amount of information.
The best-known protein sequence databases are probably PIR (the Protein Identification Resource) and SwissProt. SwissProt, which is largely derived from PIR, is the best-annotated database. A SwissProt entry contains comprehensive references and links to other databases, describing the protein's function, important sequence motifs and (where this is known) its structure.
On the Web, many of these links - including some of the references - are hyperlinks, giving access to a huge repository of information about the protein. OWL is an example of a composite protein database, derived from the primary source databases, including GenBank (with translations of gene sequences). Each unique protein is only included once.
When faced with an 'unknown' gene or protein sequence, the question a molecular biologist will ask automatically is, "What other sequences are similar to this one?". The first step is a straightforward search of the databases for similar sequences. Two of the best-known programs for this are Blast and FastA. Both can be used to search either gene or protein databases; the related program TFastA will take a gene sequence, translate it in each of the six reading frames and search a protein database with each translation. Obviously, if there are any significant matches, they will show up as sequences with a high percentage of identical residues and conservative substitutions. However, further calculations and (often) experimental evidence are needed to determine whether two proteins are evolutionary related, i.e. derived from a common ancestor. The word 'homology', which is often used as a synonym for 'similarity', should strictly be used to indicate such an evolutionary relationship between sequences.
Members of protein families that share a common fold, and a common function and mechanism, may have very little similarity over the whole length of their sequences. As an example, there is only 19% sequence identity between some members of the globin family. In most of these cases it is possible to identify family members through common sequence motifs. The best-known databases of protein 'motifs' or 'fingerprints' are PROSITE, PRINTS and (especially in the USA) BLOCKS. These databases contain patterns of residues that can be used to define protein families, and are very useful for identifying family relationships in sequences without known homologues. They can also be accessed and searched over the Internet.
Commercial packages of sequence-analysis programs, such as the GCG suite, are readily available but at a price! Fortunately, these are needed less and less as programs become available on the World Wide Web. In the UK, further sequence-analysis programs are available at Seqnet, the UK node of the European Molecular Biology Network (EMBnet). Any scientist working in a university or other 'non-profit' institution is entitled to a free account on Seqnet. Similar facilities are available at other EMBnet nodes (note by F.R. Opperdoes: access to the Belgian EMBnet Node (BEN) is not free but costs 6000 BF per account for a lab and 600 BF for any additional lab account ).
European Sites
Bioinformatics Training Resouces
The World Wide Web Virtual Library - Biomolecules: http://golgi.harvard.edu/sequences.html