Weight Matrices for Sequence Similarity Scoring
Version 2.0
May 1996
David Wheeler, Ph.D.
Department of Cell Biology,
Baylor College of Medicine
Houston, Texas
E-mail: wheeler@bcm.tmc.edu
Table of Contents
Weight Matrices for Sequence Similarity Scoring
Outline:
- Objective: Overview of methods and theories that underlie the
construction of scoring matrices.
- Examples of weight matrices for nucleotide and amino acid
scoring.
- Transition probability matrix: PAM
- Construction
- Properties
- Sources of error
- BLOSUM matrix
- Construction
- Sources of error
- Practical aspects
- Other refinements to transition probability matrices.
Reading:
- D.G. George, W. C. Barker and L. T. Hunt. (1990). Mutation
Data Matrix
and Its Uses. in Methods in Enzymology vol 183; R.F. Doolittle,
ed. pp.
333-351. Academic Press, Inc. New York.
- M.O. Dayhoff (1978) Atlas of Protein Sequence and Structure
(Natl.
Biomed. Res. Found., Washington), Vol. 5, Suppl. 3, pp.
345-352.
- S.F. Altschul (1991). Amino acid substitution matrices from
an
information theoretic perspective. J. Mol. Biol. 219
555-565.
- S.F. Altschul, M.S. Boguski, W. Gish and J.C. Wootton.
(1994). Issues
in searching molecular sequence databases. Nature Genetics 6:
119-129.
Back to Table of Contents.
Importance of scoring matrices
- Scoring matrices appear in all analysis involving sequence
comparison.
- The choice of matrix can strongly influence the outcome of the
analysis.
- Scoring matrices implicitly represent a particular theory of
evolution.
- Understanding theories underlying a given scoring matrix can
aid in making proper choice.
Similarity vs. Distance
- 1. Elements of the matrices specify the weight to assign a
given comparison by:
- the cost of replacing one residue with another (distance);
or
- a measure of the similarity for the replacement.
- 2. Distance is more naturally used for phylogenetic tree
reconstruction; similarity is used for database searching.
- 3. The logic of the algorithm doesn't change: maximizing a
similarity is fundamentally the same as minimizing a
distance.
- 4. Distance and similarity matrices are inter-convertible by
some mathematical transformation appropriate for the given
application.
Back to Table of Contents.
Examples of matrices
A remark on notation
- When we consider scoring matrices, we encounter the
convention that matrices have numeric indices corresponding to
the rows and columns of the matrix. That is, [M11]
refers to the entry at the first row and the first column. In
general, [Mij] refers to the entry at the ith row and
the jth column. To use this for sequence alignment, we simply
associate a numeric value to each letter in the alphabet of the
sequence. For example, if the alphabet is
[squiggly_A]={A,C,G,T}]
then A = 1, C = 2, etc. Thus, one would find the score for a
match between A and C at [M12]. Since we consider
different scoring matrices in this section, we distinguish
between them by using different letters for the matrix,
[Rij] refers to the Replacement matrix, [Sij]
to the log odds
matric, and so on.
Nucleotide scoring
Identity matrix (similarity)
A T C G
A 1 0 0 0
T 0 1 0 0
C 0 0 1 0
G 0 0 0 1
For elements in row i by column j:
[Sij=1],[i=j]
[Sij=0],[i!=j]
BLAST matrix (similarity)
A T C G
A 5 -4 -4 -4
T -4 5 -4 -4
C -4 -4 5 -4
G -4 -4 -4 5
Transition/Transversion Matrix
A T C G
A 0 5 5 1
T 5 0 1 5
C 5 1 0 5
G 1 5 5 0
Nucleotide bases fall into two categories depending on the ring
structure of the base. Purines (Adenine and Guanine) are two ring
bases, pyrimidines (Cytosine and Thymine) are single ring bases.
Mutations in DNA are changes in which one base is replaced by
another. A mutation that conserves the ring number is called a
transition (e.g., A -> G or C -> T) a mutation that changes the
ring number are called transversions. (e.g. A -> C or A -> T
and so on).
Although there are more ways to create a transversion, the number
of transitions observed to occur in nature (i.e., when comparing
related DNA sequences) is much greater. Since the likelihood of
transitions is greater, it is sometimes desireable to create a weight
matrix which takes this propensity into account when comparing two
DNA sequences.
Use of a Transition/Transversion Matrix reduces noise in
comparisons of distantly related sequences.
Protein scoring
- Identity matrix
[Rij=1],[i=j]
[Rij=0],[i!=j]
- Genetic Code Matrix
- Score based on minimum number of base changes required to
convert one amino acid into another.
- [DNA Image] Distance matrix
- Physical/chemical characteristics
- Attempt to quantify some physical or chemical attribute of
the residues and arbitrarily assign weights based on
similarities of the residues in this chosen property.
- Hydrophobicity matrix
Back to Table of Contents.
[Sij= log(qij/(pi*pj))]
S is the log odds ratio of two probabilities: the probability that
two residues, i and j, are aligned by evolutionary descent and the
probability that they are aligned by chance.
- [qij] are the frequencies that residue i and j are
observed to align in sequences known to be related. They are
derived from a "transition probability matrix."
- [pi] and [pj] are frequencies of occurrence of
residue i and j in the set of sequences.
- e.g., PAM250, BLOSUM62 et al.
Back to Table of Contents.
Summary of steps:
- 1. Align sequences that are at least 85% identical.
- o minimize ambiguity in alignments.
- o minimize the number of coincident mutations.
- 2. Reconstruct phylogenetic trees and infer ancestral
sequences. 71 trees containing 1,572 exchanges were used.
- 3. Tally replacements "accepted" by natural selection, in all
pair-wise comparisons (each [Aij] is the number of times
amino acid j was replaced by amino acid i in all comparisons).
- 4. Compute amino acid mutability, [mj], i.e., the
propensity of a given amino acid, j, to be replaced.
- 5. Combine data from 3 & 4 to produce a Mutation
Probability Matrix for one PAM of evolutionary distance, according
to the following formulae:
[Mij=(mj*Aij)/(sum_over_all_i Aij)] [Mjj=1-mj]
- 6. Calculate Log Odds Matrix for similarity scoring: Divide
each element of the Mutation Data Matrix, M, by the frequency of
occurance of each residue: [Rij=Mij/fi] R is a Relatedness
Odds Matrix, [fi] is the frequency of residue i. The Log
Odds Matrix, [Sij], is calculated from the relatedness
odds matrix, [Rij], simply by taking the log of each
[Rij].
- Different protein families manifest different PAM rates.
Back to Table of Contents.
- The sum of [mj] for any column, j, is one (trivial).
Note that the probability that an amino acid will change is on the
order of 1% for each amino acid. The probability that it will stay
the same is on the order ot 99% for each amino acid.
- The Mutation Probability Matrix, M1, defines a unit of
evolutionary change: specifically, 1 PAM (Accepted Point Mutation
per 100 residues).
- the matrix can be used to simulate evolution by using a
random number generator to select fate of each residue in the
sequence according to the probability given in the table.
- exposing a 100 residue protein sequence of average
composition to the evolutionary change represented by M1
results in one amino acid change, on average.
- Successive application of M1 on a sequence yields 2, 3, 4...
PAMs of evolutionary change.
- The following operations are equivalent:
- successive application of M1 on a sequence.
- matrix multiplication of M1 by itself, M1*M1, followed by
operation on a sequence.
- scaling the elements of M1 by a constant of
proportionality, [lambda] = 1,2,3... according to the
formulae below, followed by operation on a sequence:
[Mij=(lambda*mj*Aij)/sum_over_all_i Aij)]
[Mjj=1-lambda*mj]
the above equation enables the direct calculation of a matrix
for any desired PAM distance.
- The matrix has compositional information in it, since it
depends on the relative frequencies of amino acids in the pool of
sequences from which the tallies were drawn. In the extremes, the
following obtain:
- The elements of the 0 PAM matrix are 1 for [Mii]
and 0 for [Mij];
- The [infinity-] PAM matrix elements approaches the
asymptotic amino acid composition.
Back to Table of Contents
Assumptions in PAM model:
- Replacement at any site depends only on the amino acid at that
site and the probability given by the table (Markov model).
- Sequences that are being compared have average amino acid
composition.
Sources of error in PAM model
- Many sequences depart from average composition.
- Rare replacements were observed too infrequently to resolve
relativeprobabilities accurately (for 36 pairs no replacements
were observed!).
- Errors in 1PAM are magnified in the extrapolation to 250
PAM.
- The Markov process is an imperfect representation of
evolution: Distantly related sequences usually have islands
(blocks) of conserved residues. This implies that replacement is
not equally probable over entire sequence.
Back to Table of Contents.
BLOSUM (Blocks Substitution Matrix) Matrix
Steven Henikoff and Jorja G. Henikoff (1992). Amino acid
substitution
matrices from protein blocks. Proc. Natl. Acad. Sci. 89:
10915-10919.
- Starting data is conserved blocks from Blocks database.
- aligned, ungapped sequences
- widely varying similarity, but measures are taken to avoid
biasing the sample with frequently occurring highly related
sequences.
- Tallies of replacements are made by straight forward tallying
of all pairs of aligned residues, [fij]
- The observed frequency of each pair is: [qij] =
[fij]/(total number of residue pairs)
- This includes cases of i=j (i.e. no replacement
observed).
- The expected frequency of each pair is essentially the
product of the frequencies of each residue in the data
set.
- Similar sequences in a block, above a threshold percent
similarity are clustered and members of the cluster count
fractionally toward the final tally.
- Reduces the number of identical pairs (AA, SS, TT, etc.,
matches) in the final tallies.
- Somewhat analogous to increasing the PAM distance.
- If clustering threshold is 80%, final matrix is BLOSUM
80.
- Clustering at 62% reduces the number of blocks contributing
to the table by 25%-still 1.25 x 10^6 pairs contributed!
- Least frequent amino acid pair replacement was observed
2369 times!
Back to Table of Contents.
New Scoring Matrices
- David T. Jones, William R. Taylor and Janet M. Thornton(1992).
The rapid generation of mutation data matrices from protein
sequences. CABIOS 8: 275-282.
- An update to the PAM matrix using the method of Dayhoff.
59,190 accepted mutations in 16,130 sequences were tallied.
- Gaston H. Gonnet, Mark A. Cohen, Steven A. Benner (1992).
Exhaustive Matching of the Entire Protein Sequence Database.
Science 256: 1443-1445
- The answer to life the universe and everything. A scoring
matrix based on alignment of the entire SWISS-PROT data base. 1.7
x 10^6 matches were used from sequences differing by 6.4 to 100.0
PAM.
Back to Table of Contents.
Selecting an optimal PAM matrix
- As the evolutionary distance increases, the information
content of the corresponding PAM matrix decreases.
- As the information content of the PAM matrix decreases a
longer region of similarity is required to generate a sufficiently
high score to be distinguishable from chance.
- Regions of similarity in real proteins are found in narrower
"blocks" as the evolutionary distance increases.
|
PAM
distance
|
H (bits)
|
Min. signifant length
(30 bits)
|
|
10
|
3-43
|
9
|
|
20
|
2-95
|
11
|
|
30
|
2-57
|
12
|
|
40
|
2-26
|
14
|
|
50
|
2-00
|
15
|
|
60
|
1-79
|
17
|
|
70
|
1-60
|
19
|
|
80
|
1-44
|
21
|
|
90
|
1-30
|
24
|
|
100
|
1-18
|
26
|
|
110
|
1-08
|
28
|
|
120
|
0-08
|
31
|
|
130
|
0-90
|
34
|
|
140
|
0-82
|
37
|
|
150
|
0-70
|
40
|
|
160
|
0-70
|
43
|
|
170
|
0-65
|
47
|
|
180
|
0-60
|
51
|
|
190
|
0-55
|
55
|
|
200
|
0-51
|
59
|
|
210
|
0-48
|
63
|
|
220
|
0-45
|
68
|
|
230
|
0-42
|
73
|
|
240
|
0-39
|
78
|
|
250
|
0-36
|
83
|
|
260
|
0-34
|
89
|
|
270
|
0-32
|
91
|
|
280
|
0-30
|
100
|
|
290
|
0-28
|
107
|
|
300
|
0-27
|
113
|
|
310
|
0-25
|
120
|
|
320
|
0-24
|
127
|
|
330
|
0-22
|
134
|
|
340
|
0-21
|
141
|
Back to Table of Contents.
Other specialized scoring matrices
- Lee F. Kowlakowski and Kenneth A. Rice (1994)
Accepted-Mutation Parsimony Functionally Classifies
G-Protein-Coupled Receptors. Science (in press).
- Creation of a step-matrix based solely on aligned blocks of
G-Protein-Coupled Receptors in which the elements of the matrix
are proportional to the rarity of the substitution.
- Used to excellent advantage in constructing a coherent
phylogeny of widely diverged G-protein receptors.
- Jean-Michael Claverie (1993). Detecting frame shifts by amino
acid sequence comparison. J. Molecular Biology 234: 1140-1157.
- Matrices for detecting frame shift mutations giving rise to
new coding sequences (or arising from sequencing error).
- Domenico Bordo and Patrick Argos (1991). Suggestions for
"safe" residue
- substitutions in site-directed mutagenesis.
- Scoring matrix created from observed substitutions of
residues found in similar structural environments in 3D.
Back to Table of Contents.
Last updated: 8 August 1997.
created by :Fred
Opperdoes