Choosing a null hypothesis to answer the question "Are my model's predictions better than random?"

Question

I'm currently trying to evaluate a model of metabolism which aims to predict whether deleting individual genes will cause a growth defect (there are ~850 genes in total). I know from experimental data which genes show slow growth, so I'm mainly judging the model on what percentage of genes it correctly predicts. There are only two possible predictions for each gene, normal growth or reduced growth.

To try and prove the model's effectiveness (or otherwise), I'd like to compare it to a null hypothesis of "genes predicted at random". However, I'm not sure what the best form of this would be, or even if it's a sensible question.

A couple of possibilities that occurred to me were:

The same number of growth defective genes are chosen, but they are assigned at random. (For example, if the model predicts 10 particular genes cause a growth defect if deleted but the rest show normal growth, the null hypothesis is that this is indistinguishable from picking ten genes at random)
The number of genes causing defects is chosen at random, then they are distributed randomly as above.

The first one seems to use too much information for a completely random prediction, but the second could show high significance even for poor predictions, so am not quite sure how to proceed...

score 1 · Answer 1 · answered May 23 '12 at 00:54

You should think about drawing some Receiver Operating Characteristic (ROC) curves, and calculationg the Area Under the Curve (AUC) or c-statistic. I'm guessing only a small number of genes cause the defect, and there's some kind of threshold you can vary in your classification model - ROC/AUC is a useful methodology for measuring performance of a classifier this kind of situation. The AUC for a 'chance' predictor is 0.5.

If you do this, you should be aware of the methodology's limitations; there's a good paper by Cook (2007) in the journal "Circulation" on this. It's also a good idea to bootstrap the AUC statistic to work out if it's really significant.

score 0 · Answer 2 · answered Apr 30 '12 at 21:55

The appropriate notion of random in your null hypothesis should depend on your model of prediction. If your model always predicts a fixed number of growth defective genes, then your first proposal is reasonable. However, if the number varies, you may want to model that variance and try to replicate that in your null hypothesis. If it is some complicated stochastic process, then you could for instance estimate the average number of predicted growth defective genes, its variance and then choose the notion of randomness in the null hypothesis to correspond to these first and second moments.

In particular, since the number of growth defective genes is discrete and non-negative, you want to look into using the geometric distribution. If all the genes have an identical probability $p$ of being defective, then you could think of looking at a gene, flipping a biased coin that is heads with probability $p$ and then labeling it as growth defective only if the result of the coin flip was heads. Alternatively, if the genes are not identical, then you can have a geometric distribution with unique parameter for each defective gene (for a related problem, see the coupon collector problem).

EDIT: Since the geometric distribution only gives you one degree of freedom, the hypergeometric distribution might be more appropriate.

score 0 · Answer 3 · answered May 23 '12 at 07:16

You have given little details about your metabolic model; if it happens to be fitted to some source of experiment, you can use a nonparametric approach, i.e. compare your true model to a model build with original methodology but having no information about your data.

It may work like this; you first build a bunch of models on a somewhat invalidated copy of an original data (the details depend on the data, you can for instance shuffle it), extract their predictions and use them to get the null distribution of performance measure you use.

Choosing a null hypothesis to answer the question "Are my model's predictions better than random?"

3 Answers3