4

I am looking for pointer/advises on producing interval estimation (as opposed to point wise) assuming the noise on my data is not constant.

To make is simpler, let's assume the following linear model and hypothesis.

$$ y = \theta.\mathbf{x} + \sigma(x)\epsilon $$

where $y$ is our target scalar random variable, $\mathbf{x}$ is our feature vector and $\epsilon$ is a random gaussian noise, and the added variance $\sigma$ depends on x.

I want to avoid making the assumption that $\sigma(x)$ is constant in order to give tighter interval estimates where my model has a good clue that it is accurate.

I don't think using bootstrap error really address this problem as it captures the model variance rather than the data noise.

  • Is $\sigma(x)$ known or must it be estimated? If estimated, is the functional form of $\sigma(x)$ known, or not? – Glen_b Aug 22 '15 at 04:01

3 Answers3

8

Is there a reason why you don't just use Weighted Least Squares, with weight matrix = diag([1/σ(x_i) ^2]? I am presuming σ(x) is actually the standard deviation, not the variance. See link in comment.

Alternatively, if you believe you can ascribe an interval to each value of y, which definitively bounds the possible value of y, with the interval widths not necessarily being the same for the different values of y, you could employ an interval response interval arithmetic least squares algorithm - see

this book and the references in that,

http://www.nsc.ru/interval/Library/Thematic/DataProcs/MarkovILS.pdf (ancient and Eastern European).

These are all based on exactly known values of x, but interval values of y.

See comments below for additional links which I was not allowed to include in this post.

Mark L. Stone
  • 13,342
  • 1
  • 37
  • 58
  • Or http://dspace.sheol.uniovi.es/dspace/bitstream/10651/10485/1/A%20linear%20regression%20model...10000484-main.pdf for a fuzzy response variant. . – Mark L. Stone Jun 13 '15 at 03:25
  • Weighted Least Squares per https://en.wikipedia.org/wiki/Linear_least_squares_%28mathematics%29#Weighted_linear_least_squares . – Mark L. Stone Jun 13 '15 at 03:26
4

You say:

I don't think using bootstrap error really address this problem as it captures the model variance rather than the data noise.

However, it does! Your understanding of the bootstrap is mistaken.

The bootstrap is a completely non-parametric method of estimating a parameter's error. Here non-parametric means that the concept of a "model" is (almost) completely irrelevant. The parameter $\theta$ can be conceived as a method of moments estimator which summarizes the first order (e.g. linear) trend of the $X, Y$ relationship. Note also that this means that the bootstrap does not require the actual relationship between $X, Y$ to be linear, it could be sinuousoidal, quadratic, logistic, even Heaviside. The line $\theta x$ is a linear approximation to that value.

Now with regard to estimation of error: if the data are actually heteroscedastic in the sense that error depends on the X, then the constant, model based error is an X-averaged estimate of the overall error. That is, if the lower 50% of X has 0.25 SD and the upper 50% of X has a 4 SD, then the residual error has 1 SD. If you repeat the study again and again with X collected in a fixed design (so that each value of X is the same, but Y differs), then the 1 SD residual error is actually correct and the confidence intervals are of actual size. However, if X can vary, then occasionally you oversample high X leading to overlarge error estimation or oversample small X leading to too-small error estimation.

The bootstrap appropriately accounts for random design by considering sampling from the empirical distribution of X, thus simulating random differences in the X design and incorporating their added error in the CI estimation. The bootstrap accounts for hetereoscedasticity.

Of note, the sandwich error, also called the heteroscedasticity consistent (HC) error estimate has been found to be the "first order approximation" to the bootstrap. So these two estimators are related.

AdamO
  • 62,637
0

The book Mixed-Effects Models in S and S-PLUS describes fitting linear models (using OLS and NLS, with mixed-effects and nonlinear extensions) using the accompanying package nmle, available on CRAN. Section 5.2 of the book deals with hetero-scedastic situations, including variance as a function of x as you describe. It's a great resource.

goangit
  • 566
  • 3
  • 12