0

I was thinking of this problem, and I'm not sure if I'm right with this approach.

X is a R.V. with unknown distribution, bounded to the interval [a,b], with a < b and both finite. If I take a very large sample of this population, separate training and testing dataset (the latter with n samples) train a model (keeping things general, because I don't want to use any property of it) to make predictions to future samples, what is the prediction interval (not the confidence interval) for a single data point ?

I thought that because X is bounded, the maximum variance of the population is $$(b-a)^2/4$$ so the (1-α)100% prediction interval for a data point wouldn't be larger than: $$ \left(\hat{x} - t_{1-\alpha/2, n-1}*\frac{|b-a|}{2}*{\sqrt{{\frac{n+1}{n}}}}, \hat{x} + t_{1-\alpha/2, n-1}*\frac{|b-a|}{2}*{\sqrt{{\frac{n+1}{n}}}}\right) $$

That's basically the equation for the Prediction Interval if X is Normal, but could I use it even if X isn't ?

Should I assume a distribution for the population ?

Maybe I could make tighter bounds with Confidence Intervals of the sample variance, but I wanted to make things simpler. I would also appreciate any references for non-parametric prediction intervals

1 Answers1

2

In general, you can't say.

If all you know is that $X$ lives on a bounded interval $[a,b]$, then $X$ could be a point mass on $a$ (or on $b$), and your prediction interval would contain only a single point.

Alternatively, $X$ could have point masses on both $a$ and $b$, say with equal probability. Then your prediction intervals will all consist of the entire interval $[a,b]$.

If you dislike point masses, you can always use densities proportional to $x^n$ for $n$ large enough on the interval $[a,b]=[0,1]$ to get prediction intervals of the form $[1-\epsilon,1]$ for arbitrarily small $\epsilon$.


To clarify the last paragraph: some people consider using point masses for counterexamples a kind of "cheating". So we can also consider a random variable $X_n$ (indexed by $n$) with a density $f_n(x)=(n+1)x^n$ on the interval $[0,1]$. A little integration shows that $$ EX_n = \int_0^1 xf_n(x)\,dx = \frac{n+1}{n+2} $$ and that $$ \int_c^1 f_n(x)\,dx = q \quad\iff\quad c=\sqrt[n]{1-q}. $$ Thus, a (say) 95% prediction interval can be given in the form $[\sqrt[n]{0.05},1]$, and if we increase $n$ enough, this can be as short as we want to, $[1-\epsilon,1]$ for any small $\epsilon$. Or conversely, you could use small $n$ for wider prediction intervals. Or concentrate the mass near $0$ instead of $1$ by considering $f_n(1-x)$ instead of $f_n(x)$.

Of course, "the" prediction interval does not exist. If you additionally want symmetry around the expectation of $\frac{n+1}{n+2}$, you need to do a little more algebra.

Stephan Kolassa
  • 123,354
  • I didn't understand this part: "If you dislike point masses, you can always use densities proportional to xn for n large enough on the interval [a,b]=[0,1] to get prediction intervals of the form [1−ϵ,1] for arbitrarily small ϵ" – Fernando Oct 20 '21 at 19:19