President, Biomedical Computing, Inc.
Houston, Texas 77005
The fundamental building blocks of life are proteins. Enzymes, which are the molecular machines responsible for virtually all of the chemical transformations that cells are capable of, are proteins. In addition, much of the structure of a cell is made up of proteins. That part of the structure which is not made up of proteins is produced by enzymes which are proteins. A human contains on the order of 100,000 different proteins. It is the properties of and the interactions between these 100,000 proteins that make us what we are.
Proteins are variable length linear, mixed polymers of 20 different amino acids. Other terms used more or less interchangeably for amino acid polymers are peptides and polypeptides. These topologically linear polymers fold upon themselves to generate a shape characteristic of each different protein, and this shape along with the different chemical properties of the 20 amino acids determine the function of the protein. One of the most important concepts in modern biology is that the functional properties of proteins is determined largely by the sequence of the 20 amino acids in the linear polypeptide chain; that in many cases proteins are largely self-folding. Thus, in theory, knowing the sequence of a protein (the order with which the amino acids occurred) one could infer its function.
What determines the order of amino acids in a protein? The Central Dogma of Molecular Biology describes how the genetic information we inherit from our parents is stored in DNA, and that information is used to make identical copies of that DNA and is also transferred from DNA to RNA to protein. DNA is a linear polymer of 4 nucleotides deoxyAdenosine monophosphate (abbreviated A), deoxyThymidine monophosphate (abbreviated T), deoxyGuanosine monophosphate (abbreviated G) and deoxyCytidine monophosphate (abbreviated C). RNA is a very similar polymer of Adenosine monophosphate, Guanosine monophosphate, Cytidine monophosphate, and Uridine monophosphate. Uridine monophosphate, abbreviated U, is a nucleotide functionally equivalent to Thymidine monophosphate.
A property of both DNA and RNA is that the linear polymers can pair one with another, such pairing being sequence specific. In such double polymers (referred to as a "double helix" due to the shape they assume) G pairs with C and A pairs with T or U. All possible combinations of DNA and RNA double helices occur. One strand DNA can serve as a template for the construction of a complementary strand, and this complementary strand can be used to recreate the original strand. This is the basis of DNA replication and thus all of genetics. Similar templating results in an RNA copy of a DNA sequence. Conversion of that RNA sequence into a protein sequence is more complex. This occurs by translation of a code consisting of three nucleotides into one amino acid, a process accomplished by cellular machinery including tRNA and ribosomes.
Four different nucleotides taken three at a time can result in 64 different possible triplet codes; more than enough to encode 20 amino acids. The way that these 64 codes are mapped onto 20 amino acids is first, that one amino acid may be encoded by 1 to 6 different triplet codes, and second, that 3 of the 64 codes, called stop codons, specify "end of peptide sequence". Where multiple codons specify the same amino acid, the different codons are used with unequal frequency and this distribution of frequency is referred to as "codon usage". Codon usage varies between species.
The fact that DNA nucleotides need to be read three at a time to specify a protein sequence implies that a DNA sequence has three different reading frames determined by whether you start at nucleotide one, two, or three. (Nucleotide four will be in the same frame as nucleotide one and so on). Both strands of DNA can be copied into RNA (for translation into protein). Thus, a DNA sequence with its (inferred) complementary strand can specify six different reading frames.
It is possible to chemically determine the sequence of amino acids in a protein and of nucleotides in RNA or DNA. However, it is vastly easier at present to determine the sequence of DNA than that of RNA or protein. Since the sequence of a protein can be determined from the DNA sequence which encodes it, most protein sequences are in fact inferred from DNA sequences. Conversion of RNA to a DNA copy (cDNA) is a simple laboratory procedure, so RNA molecules are themselves sequenced as cDNA copies.
Sequence analysis is the process of making biological inferences from the known sequence of monomers in protein, DNA and RNA polymers.
Back to the table of contents
Although it is possibly true in theory that given a protein sequence one can infer its properties, current state of the art in biology falls far short of being able to implement this in practice. Current sequence analysis is a painful compromise between what is desired and what is possible. Some of the many factors which make sequence analysis difficult are discussed in this section.
As noted above, the difficulty of sequencing proteins means that most protein sequences are determined from the DNA sequences encoding them. Unfortunately, the cellular pathway from DNA to RNA to Protein includes some features that complicates inference of a protein sequence from a DNA sequence.
By and large, global, complete solutions are not available for determining an encoded protein sequence from a DNA sequence. However, by combining a variety of computational approaches with some laboratory biology, people have been fairly successful at accomplishing this in many specific cases. Nonetheless, this problem is currently considered one of the most important in computational biology.
Once you have obtained a protein sequence, inferring structure and function represent vastly greater problems. As is noted above, the structure of a protein is produced by the folding of a peptide chain back on itself, and in some cases, the association of multiple peptide chains. This folding can occur as rotation around both bonds within the constituent amino acids as well as the bonds that join the amino acids one to another. Unfortunately (or fortunately, as life depends on this fact), the number of possible folding patterns is effectively infinite. To help cope with this daunting problem, biologists have divided the structural features of proteins into levels. The first level of structure, termed primary structure, refers just to the sequence of amino acids in the protein; this is what we know. Decades ago, it was found that polypeptide chains can sometimes fold into regular structures; that is, structures which are the same in shape for different polypeptides. One such shape is helical, and is referred to as an alpha helix. In another such shape, the polypeptide chain folds back and forth, producing a sheet-like surface. This structure is referred to as a beta sheet. There are additional examples of secondary structural types into which a polypeptide might fold, and some peptides do not fold into one of these regular structures at all. In fact, most long polypeptide chains (e.g. virtually all real biological proteins) fold into different secondary structures along different portions of their length.
The secondary structures described above are all very simple and regular; the round and round of an alpha helix or the back and forth of a beta sheet. There are other structures which are found over and over in different proteins which are more complex than this. One example is the helix-loop-helix motif found in many transcription factors. These features are referred to as super-secondary structure. When you look at an actual polypeptide chain, the final shape is made up of secondary features, perhaps super-secondary structural features, and some apparently random conformations. This overall structure is referred to as the tertiary structure. Finally, many biological proteins are constructed of multiple polypeptide chains. The way these chains fit together is referred to as the quarternary structure of the protein.
The reason that this complex nomenclature for protein structure has developed is that the problem of understanding protein structure is so important and so difficult. The importance of understanding protein structure comes from two factors working together. The first of these is that the function of the protein is absolutely dependent on its structure. In fact, one of the most common ways for proteins to loose their function is to have their structure disrupted; for example by heat or mechanical stress (e.g. beating an eggwhite); only completely and properly folded proteins "work". The second factor is that it is extremely difficult to determine the structure of a protein experimentally. To date, the primary structure of many sequences has been determined (about 30,000 , available from SwissProt). In contrast, the tertiary structure of many fewer (about 500, available from the Brookhaven Protein Database) has been determined. Obviously, then, it would be of great value if tertiary structure could be determined from primary structure. It is not an exaggeration to state that the ability to exactly predict protein structures and, from that, protein function would revolutionize medicine, pharmacology, chemistry and ecology.
Current research on tertiary structure prediction has used two basic approaches; homology based and ab initio. Homology-based approaches attempt to determine the tertiary structure of a protein by comparing its primary sequence to that of a related proteins whose structure is known. This is a laborious but fairly successful approach which may lead to a high success rates. Also when the structure of a homologous protein is not yet available the alignment of homologous proteins may provide information about their secondary structure and such predictions are reliable to between 70 and 80%. There is such a secondary structure prediction server available on the at the EMBL which is called PredictProtein . Unfortunately, these methods require the existence of similar protein(s) with known structure(s); something not always available. Ab initio approaches try to determine the structure which minimizes free energy. This is done using either Monte-Carlo methods or Neural Net software.
Finally, even if/when you determine the tertiary structure of a protein, techniques have not yet been developed for inferring the functional properties of this protein from its structure.
Back to the table of contents
The computer programs which are used to infer protein sequence from DNA sequence provide information which can be used to help approach a solution. For example, if you are trying to find out where in a DNA sequence a protein is encoded, it is very useful to know what peptides would be encoded by all six reading frames. A stretch containing many stop codons is a poor candidate for encoding a protein. This will not absolutely tell you where the protein sequence starts and stops, but it will help you guess where that might occur. Programs exist for doing this. In fact, there are many factors you can use to guess where in a DNA sequence a protein sequence might reside; use of the expected codon bias, presence of characteristic sequences representing regulatory signals in the DNA, and so forth. One family of programs integrates a variety of these approaches, and, using either explicit algorithms or trained neural nets, makes a prediction.
Back to the table of contents
If you have just determined a sequence of an interesting bit of DNA, one of the first questions you are likely to ask yourself is "has anybody else seen anything like this?" Fortunately, there has been a very successful international effort to collect all the sequences people have determined in one place so they can be searched. For DNA sequences, three groups have cooperated in this effort, one in Japan, one in Europe, and one in the United States to produce DDBJ, EMBL and GenBank, respectively. These databases are frequently reconciled with each other, so that searching any one is virtually the same as searching all three. The problem is that these databases are HUGE and, as a result, you must compare your sequence with this vast number of other sequences efficiently. A number of programs have been written to rapidly search a database for a query sequence, two of which, BLAST and FASTA, will be discussed in this course. The techniques used by these programs to make searching rapid result in some loss of rigor of comparison. It is possible (although, as it turns out, unlikely) that a weak but relevant similarity could be missed by these programs. In addition, many times these programs will flag a sequence as being similar to your query sequence when this similarity is not significant. Thus, these programs should be seen as tools for identifying a small subset of sequences from the database for retrieval and further analysis rather than ends in themselves.
Databases of protein sequences, including SwissProt and PIR, also exist and can similarly be searched.
Which program should you use to search a database, FASTA or BLAST? This question is about as controversial as that over choices of computers (Mac vs. PC) or religions. In fact, as you enter the world of sequence analysis, you will find religious wars between proponents of different programs over and over. Worse, new programs are constantly appearing. In addition, even after having selected a program, you will frequently have to select values for "parameters" and always have to interpret the output. There are no magic answers to help you do these things. What you will acquire in this course is the background you need to make reasonable decisions on these issues.
Back to the table of contents
Although it is not possible to completely predict the function or shape (structure) of a protein from its sequence de novo, some useful inferences about structure and function can be drawn, especially by comparing the sequence of a protein of unknown structure and function to sequences of proteins with known structure and function. Second, if the goal of structure/function prediction is to be reached in the future, it will be because of partial analyses done in the present. Third, by comparing the sequence of equivalent proteins from different species of animals (such equivalent proteins are called "homologues"), one can draw inferences about the evolution of these species from their common ancestors.
One of the most useful things people do with sequences is to compare them to other sequences. However, such comparisons are not as easy to make as one might first think. One factor that complicates analysis is that the sequences biologists need to compare are usually not identical, but only similar. In addition to having a small number of substitutions (e.g. a Guanine for an Adenine at one position in a DNA sequence) there will be insertions and deletions in one sequence relative to the other. Also, depending what you are comparing and what you want to learn from the comparison, how you do the comparison will be different. For these reasons, there have been many different kinds of programs written to compare sequences.
Back to the table of contents
Where can you find these programs and what do you need to have and to do to run them? There is no one place where all possible sequence analysis programs reside and there is no one way to run them. You might buy a commercial sequence analysis package such as DNA* or MacVector to run on the PC or Macintosh sitting on your desktop. You might go out on the network and download the source code for a program which you compile and run on your computer. Your institution may have a computer center where various programs, both commercial and free, have been installed. You might write your program based either on an algorithm you read about in a journal, or even an algorithm you derive yourself. (This happens rarely, and I don't recommend it if you are interested in biology rather than bio-computing). Or finally, there are now places on the network where programs run that you can connect to and use. This latter possibility is dealt with in the practical course. Because none of these approaches is perfect, you will probably decide to do some combination of all of them.