Weight Matrices for Sequence Similarity Scoring

Version 2.0
May 1996

 

David Wheeler, Ph.D.

Department of Cell Biology,
Baylor College of Medicine
Houston, Texas
E-mail: wheeler@bcm.tmc.edu

 

Table of Contents


Weight Matrices for Sequence Similarity Scoring

Outline:

 

Reading:

 

  • D.G. George, W. C. Barker and L. T. Hunt. (1990). Mutation Data Matrix
    and Its Uses. in Methods in Enzymology vol 183; R.F. Doolittle, ed. pp.
    333-351. Academic Press, Inc. New York.
  • M.O. Dayhoff (1978) Atlas of Protein Sequence and Structure (Natl.
    Biomed. Res. Found., Washington), Vol. 5, Suppl. 3, pp. 345-352.
  • S.F. Altschul (1991). Amino acid substitution matrices from an
    information theoretic perspective. J. Mol. Biol. 219 555-565.
  • S.F. Altschul, M.S. Boguski, W. Gish and J.C. Wootton. (1994). Issues
    in searching molecular sequence databases. Nature Genetics 6: 119-129.

Back to Table of Contents.


Importance of scoring matrices

 

Similarity vs. Distance

Back to Table of Contents.


Examples of matrices

A remark on notation

 

  • When we consider scoring matrices, we encounter the convention that matrices have numeric indices corresponding to the rows and columns of the matrix. That is, [M11] refers to the entry at the first row and the first column. In general, [Mij] refers to the entry at the ith row and the jth column. To use this for sequence alignment, we simply associate a numeric value to each letter in the alphabet of the sequence. For example, if the alphabet is

    [squiggly_A]={A,C,G,T}]

    then A = 1, C = 2, etc. Thus, one would find the score for a match between A and C at [M12]. Since we consider different scoring matrices in this section, we distinguish between them by using different letters for the matrix, [Rij] refers to the Replacement matrix, [Sij] to the log odds
    matric, and so on.

 

Nucleotide scoring

Identity matrix (similarity)

   A  T  C  G
A  1  0  0  0
T  0  1  0  0
C  0  0  1  0
G  0  0  0  1

 

For elements in row i by column j:

[Sij=1],[i=j]
[Sij=0],[i!=j]

BLAST matrix (similarity)

   A  T  C  G
A  5 -4 -4 -4
T -4  5 -4 -4
C -4 -4  5 -4
G -4 -4 -4  5

Transition/Transversion Matrix

   A  T  C  G
A  0  5  5  1
T  5  0  1  5
C  5  1  0  5
G  1  5  5  0


Nucleotide bases fall into two categories depending on the ring structure of the base. Purines (Adenine and Guanine) are two ring bases, pyrimidines (Cytosine and Thymine) are single ring bases. Mutations in DNA are changes in which one base is replaced by another. A mutation that conserves the ring number is called a transition (e.g., A -> G or C -> T) a mutation that changes the ring number are called transversions. (e.g. A -> C or A -> T and so on).

Although there are more ways to create a transversion, the number of transitions observed to occur in nature (i.e., when comparing related DNA sequences) is much greater. Since the likelihood of transitions is greater, it is sometimes desireable to create a weight matrix which takes this propensity into account when comparing two DNA sequences.

Use of a Transition/Transversion Matrix reduces noise in comparisons of distantly related sequences.

 

Protein scoring

Back to Table of Contents.


Log odds matrices

[Sij= log(qij/(pi*pj))]
S is the log odds ratio of two probabilities: the probability that two residues, i and j, are aligned by evolutionary descent and the probability that they are aligned by chance.

 

Back to Table of Contents.


PAM Matrix

Summary of steps:

 

Back to Table of Contents.


 

Properties of Mutation Probablitiy Matrix

Back to Table of Contents


 

Assumptions in PAM model:

 

Sources of error in PAM model

Back to Table of Contents.


BLOSUM (Blocks Substitution Matrix) Matrix

Steven Henikoff and Jorja G. Henikoff (1992). Amino acid substitution
matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915-10919.

 

Back to Table of Contents.


New Scoring Matrices

Back to Table of Contents.


Selecting an optimal PAM matrix

 PAM

distance

 H (bits)

 Min. signifant length

(30 bits)

 10

 3-43

 9

 20

 2-95

 11

 30

 2-57

 12

 40

 2-26

 14

 50

 2-00

 15

 60

 1-79

 17

 70

 1-60

 19

 80

 1-44

 21

 90

 1-30

 24

 100

 1-18

 26

 110

 1-08

 28

 120

 0-08

 31

 130

 0-90

 34

 140

 0-82

 37

 150

 0-70

 40

 160

 0-70

 43

 170

 0-65

 47

 180

 0-60

 51

 190

 0-55

 55

 200

 0-51

 59

 210

 0-48

 63

 220

 0-45

 68

 230

 0-42

 73

 240

 0-39

 78

 250

 0-36

 83

 260

 0-34

 89

 270

 0-32

 91

 280

 0-30

 100

 290

 0-28

 107

 300

 0-27

 113

 310

 0-25

 120

 320

 0-24

 127

 330

 0-22

 134

 340

 0-21

 141


Back to Table of Contents.


Other specialized scoring matrices

Back to Table of Contents.


Last updated: 8 August 1997.
created by :Fred Opperdoes