Randomness in BLAST

Question

So if you see the BLAST parameters it says

The Expected value E is a parameter that describes the number of hits one can "expect" to get by chance when searching a database of particular size. It decreases exponentially as the score (S) increases. Essential, the E value describes the randomness background noise. For example an E value of 1 assigned to a hit can be interpreted as meaning that a database of the current size one might expect to see 1 match with a similar score simply by chance. Lower the E value the better."

How could there be something "BY CHANCE" it's a software right ?

Pallie · Answer 1 · 2019-02-18T09:18:08.300

The Blast E is the expected frequency of obtaining false positive. This will depend on your query size (number of nucleotide or amino acid residues) and the size of the database. With a short query and a large database you are more likely to have a sequence in your database that matches your query by simple chance. In summary, the smaller the query and the larger the database the greater the chance of a spurious result and that is what E is measuring.

d_kennetz · Answer 2 · 2019-02-15T19:48:51.473

Sure, so you start out with what is called a bitscore, which is a normalized to the score calculated from the alignment between the 2 seqs which depends on the following equation. It is independent of database size:

(lambda * S - ln(k))/(ln)2

Then, the p-value of a local blast is just:

1/2^bitscore

So if your bitscore is 15, you need 1/32768 alignments before you will get a score as good or better (highly similar sequences) BY CHANCE ALONE. This addresses your "BY CHANCE" question.

Earlier I said the bitscore was independent of database size. The E value is just the p-value above normalized to the database size (so it is dependent on the db size) by the following equation:

(query length * database length * p-value)

which simplifies a bit to:

E = (query length* (db length/2^bitscore)

I hope this helps!

M__ · Answer 3 · 2019-02-15T12:03:35.940

0

Calculation of E for Blast is a good question.

Two methods:

Use a Poison or binomial distribution.
Use randomization

I think its no. 1 because no. 2 would use too much computational power. I did know but someone borrowed by O'Reilly book on the subject.

One of the contributors here will know, because I suspect their genomics developers.

edited Feb 15 '19 at 12:03

answered Feb 14 '19 at 22:21

M__

12,263
5
28
47

Randomness in BLAST

3 Answers3