6

So if you see the BLAST parameters it says

The Expected value E is a parameter that describes the number of hits one can "expect" to get by chance when searching a database of particular size. It decreases exponentially as the score (S) increases. Essential, the E value describes the randomness background noise. For example an E value of 1 assigned to a hit can be interpreted as meaning that a database of the current size one might expect to see 1 match with a similar score simply by chance. Lower the E value the better."

How could there be something "BY CHANCE" it's a software right ?

user37060
  • 61
  • 1

3 Answers3

7

The Blast E is the expected frequency of obtaining false positive. This will depend on your query size (number of nucleotide or amino acid residues) and the size of the database. With a short query and a large database you are more likely to have a sequence in your database that matches your query by simple chance. In summary, the smaller the query and the larger the database the greater the chance of a spurious result and that is what E is measuring.

Pallie
  • 697
  • 5
  • 11
5

Sure, so you start out with what is called a bitscore, which is a normalized to the score calculated from the alignment between the 2 seqs which depends on the following equation. It is independent of database size:

(lambda * S - ln(k))/(ln)2

Then, the p-value of a local blast is just:

1/2^bitscore

So if your bitscore is 15, you need 1/32768 alignments before you will get a score as good or better (highly similar sequences) BY CHANCE ALONE. This addresses your "BY CHANCE" question.

Earlier I said the bitscore was independent of database size. The E value is just the p-value above normalized to the database size (so it is dependent on the db size) by the following equation:

(query length * database length * p-value)

which simplifies a bit to:

E = (query length* (db length/2^bitscore)

I hope this helps!

d_kennetz
  • 631
  • 5
  • 17
0

Calculation of E for Blast is a good question.

Two methods:

  • Use a Poison or binomial distribution.
  • Use randomization

I think its no. 1 because no. 2 would use too much computational power. I did know but someone borrowed by O'Reilly book on the subject.

One of the contributors here will know, because I suspect their genomics developers.

M__
  • 12,263
  • 5
  • 28
  • 47