3

I calculated an 80% prediction interval of the outcome of interest (proportion of patients, average temperature etc.) based on my previous study. Is it true that in the future study I will receive the outcome within the calculated prediction interval with 80% probability? Which means that if I do a billion studies the outcome will be within the prediction interval in about 80% of them?

Oleg
  • 41

4 Answers4

2

The answer is no: no such (nearly) free lunch can exist, even if the model is perfect.

Consider for instance a linear regression using $n$ observations $[x_i,\,Y_i]$. Given a "new" design point $x^\star$ you can compute a $80\%$ prediction interval; It contains one new $Y^\star$ corresponding to $x^\star$ with probability of $80\%$. But it does not contain $80\%$ of a large number of new values $Y^\star_j$ made at $x^\star$, even if the model is perfectly adequate. As suggested by the formulation of your question, you can not guess exactly the distribution of an infinite number of new values from a finite number of observations.

To reach the wanted probability, you have to draw a new dataset with observations $[x_i, \, Y_i]$ for each prediction that you make. This is well explained by G.W. Senedecor and W.G. Cochran in the regression chapter of their famous Statistical Methods book.

An alternative where the expected coverage rate holds is when the prediction is updated sequentially, thus modifying the prediction interval: the first new couple $[x^\star_j, \,Y^\star_j]$ is included in the data, then the model estimates and the prediction interval are updated before a new prediction is made. Again, this must be repeated for each prediction, and the coverage will be reached only in the long run. This context is classical in time-series analysis.

Yves
  • 5,358
1

There are already several answers to this question but on my reading, they are all incorrect (or easily misinterpreted).

A frequentist prediction interval works like a frequentist confidence interval. The "promise" of a frequentist confidence interval holds on repeated samples,e.g. if I construct a 95 percent confidence interval then on repeated samples 95 percent of the intervals will contain the parameter value.

Any particular interval will either contain the parameter value or not though. So once you've constructed the interval, there is no guarantee, for example, that the parameter falls in the interval with a certain probability.

This is also how a prediction interval works. You 1) sample a distribution, 2) construct an 80 percent prediction interval and then 3) sample the distribution again. Now if your repeat steps 1 - 3, thousands of times, the outcome from step 3) will fall in prediction interval from step 2) 80 percent of the time.

However, if you follow step 1) and 2) only once and then repeat step 3) thousands of times, the prediction interval from step 2) will almost certainly not cover the outcome from step 3) 80 percent of the time.

So to the question,

I calculated an 80% prediction interval of the outcome of interest (proportion of patients, average temperature etc.) based on my previous study. Is it true that in the future study I will receive the outcome within the calculated prediction interval with 80% probability?"

the answer is no.

num_39
  • 1,454
0

It is probably not true. In an ideal world, where observations are truely i.i.d. and your first study had a truely representative sample and all the other assupmtions (distributions etc.) were true, than yes. However, in real life all those assumptions fail to be true and thus you will probably face larger variances then expected.

All of this is of different importance in different fields. It may be possible, to do the same experiement twice in physics - but then again, probably in fields that use statistics only rarely. In fields like medicine, psychology, sociology, the idea of a second sample that is identically distributed as the first sample, most of the time remains a dream.

Bernhard
  • 8,427
  • 17
  • 38
  • Thanks for your answer. I am just interested in statistical point of view of this problem, not representativeness and so on. – Oleg May 30 '18 at 15:10
  • "In an ideal world, where observations are truely i.i.d. and your first study had a truely representative sample and all the other assupmtions (distributions etc.) were true, than yes." This is not how a prediction interval works. You would instead need to sample one billion times and predict based on this sample one billion times. – num_39 Mar 22 '23 at 12:40
0

Thanks for your answers. According to the citation below (book: Statistical Intervals A Guide for Practitioners and Researchers) the answer is Yes. Suppose that, based on the x = 20 nonconforming integrated circuits from n = 1,000 randomly selected units, the manufacturer desires a 95% prediction interval to contain the number of nonconforming units in a future sample of m = 1,000 randomly sampled units from the same production process. The conservative method given by (6.9) gives [Y,Y] = [9, 35] for a conservative 95% prediction interval and Y = 32 for an upper conservative 95% prediction bound for Y . Thus, based on x = 20 nonconforming units from the n = 1,000 sample units, one can, for example, assert, with (at least) 95% confidence, that the number of nonconforming units in the future sample of m = 1,000 will not exceed 32 units.

Oleg
  • 41
  • The answer does not correspond to the question. If you take $N^\star:= $ a billion of randomly samples of $n=1000$ selected units and evaluate the number of nonconforming unit $Y_i^\star$ for each sample $i$, it will not be true that $95%$ of the $N^\star$ values $Y_i^\star$ fall in your prediction interval. The fact that $n=1000$ may seem large is of no importance here. You can try this by simulating a binomial distribution. – Yves May 31 '18 at 05:43
  • I thought that "confidence" for the future sample is equivalent to the proportion of future infinite samples in which the outcome is within the prediction interval. – Oleg May 31 '18 at 07:14
  • The expression "with 95% confidence" is somewhat misleading here because a confidence interval is for an unknown non-random quantity. Yet 'in the future sample' does not mean 'across a large number of future samples'. – Yves May 31 '18 at 09:37