4

Does anyone here know how to calculate a value for the asterisk (*) code that appears in substitution matrices?
From my observation, to all pairs with a one asterisk, the lowest value from the matrix is set. For ** it's 1. Example: BLOSUM62. But, it's just my observation, I couldn't find any source or explanation how it was calculated (for the exemplary BLOSUM62 or any other substitution matrix).

I'd appreciate sources, especially scientific papers.

maciejwww
  • 227
  • 1
  • 14
  • Can you please explain what you mean by "a value"? You already gave the values: 1 for another stop, and -4 for any aminoacid in the example you linked to. What else do you need? – terdon Oct 15 '23 at 11:17
  • I want to know what's the formula / method behind these values / scores. – maciejwww Oct 15 '23 at 19:14
  • OK, in which matrix exactly, then? PAM? BLOSUM? Which PAM? Which BLOSUM? do you know that the asterisk stands for a stop codon? – terdon Oct 15 '23 at 20:52
  • Ye, I heard about it. – maciejwww Oct 15 '23 at 23:51
  • 1
    BLOSUM would be the best, but as I couldn't find it for a long time, so I would appreciate any source. – maciejwww Oct 15 '23 at 23:53

1 Answers1

2

Its really easy. No formula is needed.

Protein matrices are based on observed frequencies from nucleotide alignments for a variety of genes. They are not based on a priori formulas (it's not like nucleotides). Later matrices incorporated phylogenetic criteria using maximum likelihood - it's really complicated how it works.

Blosum used odds-ratio (OR) I think without a phylogenetic tree. I suspect raw OR. Explaining with/without a phylogenetic tree is a full lecture in phylogenetic theory, just think it'll be a raw value calculated directly.

So lets just run with raw OR based matrix ...

So ... * is a stop-codon it carries the lowest weight in the matrix because each protein only has one of them. So a stop-codon frequency in comparison to any amino acid is going to be extremely low - because for that aligned position there's only stop-codons (i.e. no other amino acids). Thus its -4 here (lowest value in the matrix).

What about the 1? However, when you compare a stop codon frequency ... between proteins, well its occurrence is always 1 to 1, right? Each protein has one stop codon, compare it to another protein ... thats got one stop-codon too. So one million proteins which are nicely aligned to be homologues (thats how its calculated) have one million stop codons ... thus its 1 no matter how many protein there are in the alignment. The alignment position is always the same because stop codons (at least in Blosum) are always homologous.

Once more, comparing a stop codon (in a protein), with the other stop codon in another protein some the same homologous "amino acid" site - no matter which protein it is - its always going to be 1 because its a universal feature of all proteins.

Thats how it works and I would simply cite

Henikoff, S.; Henikoff, J.G. (1992). "Amino Acid Substitution Matrices from Protein Blocks". PNAS. 89 (22): 10915–10919


I've checked the paper and the authors didn't use stop codons, nor did they use any phylogenetic criteria (that was easy to guess). I know they didn't include stop codons because because,

  1. The Sigma function for matrix operations has a 20 above it (i.e. 20 amino acids), there's no allowance for stop codons.
  2. The matrices they present omit stop codons.

When NCBI are leveraging this matrix they need answers for all possibilities, the idea that X (any amino acid) or * is omitted would not be cool for them. Thus they could have approached the authors to fill in this information, this would probably mean redoing the alignments to include the stop codon (which the authors wouldn't have liked), or NCBI could have added it after the event (its a pretty safe addition) because the log-ratio of an invariant site is always 1. The authors/NCBI would just need to be sure the stop codon was in the same position for all protein alignments in Blosum.

You might find the wiki more understandable https://en.wikipedia.org/wiki/BLOSUM

M__
  • 12,263
  • 5
  • 28
  • 47