1

Very often in the research we want to establish a linear relationship without intercept: $y=\beta x + \epsilon$ and the sample has many double zero observations $x=0, y=0$. I am wondering how to deal with these many zeros. On one hand, they are the actual observations and we should not exclude them. On the other hand, they do not contribute estimating the real value of $\beta$, because with zero-zero observations, any $\beta$ value would be correct.

The result of including all the zero-zero in the linear regression would end up with an estimated $\beta$ which is significant but relying on the only few non zero-zero observations.

Here is an extreme example of response $y$ and exploratory variable $x$. Both of them contains 100 zeros and only 2 non zeros.

x <- c(rep(0,100), 5, 10)
y <- c(rep(0,100), 10, 20)
fit <- lm(y~x-1)
summary(fit)

The model result shows that the estimated value of $\beta$ is 2 and it is significantly different from zero.

Coefficients:
Estimate Std. Error t value Pr(>|t|)    
 x        2          0     Inf   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0 on 101 degrees of freedom
Multiple R-squared:     1,      Adjusted R-squared:     1 
F-statistic:   Inf on 1 and 101 DF,  p-value: < 2.2e-16 

The main purpose is to estimate what the value of β is. In this example, I could argue that with 100 zero-zero, β could be of any value, but the model output fixes it to 2, just based on the 2 non zero-zero values. In this case where many zero-zero observations are in the sample, is it the proper way to establish the linear relationship?

tiantianchen
  • 2,101
  • What do you want to illustrate with your example ? There's perfect linearity in this example ($y=2x$), what is surprising in the output ? – Stéphane Laurent Jul 07 '14 at 20:35
  • The main purpose is to estimate what the value of $\beta$ is. In this example, I could argue that with 100 zero-zero, $\beta$ could be of any value, but the model output fixes it to $2$, just based on the 2 non zero-zero values. – tiantianchen Jul 07 '14 at 20:40
  • Ok, I think I see what you mean. Your illustrating example is very particular (perfect linearity), you could also take y <- c(rep(0,100), 10, 19) and note that the confidence interval around $\beta$ is highly precise and this is somewhat counter-intuitive. – Stéphane Laurent Jul 07 '14 at 20:45
  • 1
    ... but this is not counter-intuitive after noting that the $100$ pairs of zeros are an evidence that $\sigma$ is highly small. – Stéphane Laurent Jul 07 '14 at 20:52
  • As soon as you said "establish a linear relationship without intercept" you implicitly said something like "observation of values of $(0,0)$ support the assumption of no intercept, but will take no further part in estimation of the model". – Henry Jul 07 '14 at 22:17
  • @Henry, whether the intercept is zero or not can be validated by comparing two models (with and without intercept) as suggested below. However, my real worry is that (0,0) does suppress the SE of $\beta$, giving $\beta$ a very low SE, which is not true by common sense. If I leave out all the (0,0) and apply model again, the SE of $\beta$ becomes much larger. – tiantianchen Jul 08 '14 at 06:37
  • @tiantianchen: If you assume there is no intercept then the $(0,0)$ values should give you no information about estimating $\beta$ or the uncertainty in any estimate of $\beta$ since your estimate of $\beta$ is presumably $\dfrac{\sum x_i y_i}{\sum x_i^2}$. The $(0,0)$ values do not fit your $y_i=\beta x_i + \epsilon_i$ model since they all have $\epsilon_i=0$, and so they be ignored in calculating the uncertainty in $\beta$. – Henry Jul 08 '14 at 08:07
  • thanks @Henry. The SE of $\beta$ is estimated as $(X^{T}X)^{-1}\sigma^{2}$, while the estimated $\sigma^{2}=RSS/(n-p)$. Although the large amount of (0,0) wouldnt affect RSS, but it makes $n-p$ much larger, which eventually reduces the SE of $\beta$. Is it correct? – tiantianchen Jul 08 '14 at 08:37

1 Answers1

1

The problem here seems to be not the occurence of $(0,0)$-values (since any other points $(x,y)$ will behave similarly), but rather a problem of model selection. If understand you right, you want to assume a linear model and then let the fitting procedure state doubts about either the fitted parameters or about the linear model itself.

Some points into three different directions:

  • First, within the linear model, you could use Bayesian regression and then find that the slope has a much larger variance than the intercept (of course this depends also on the prior). Adding more (0,0) values will mainly narrow the distribtion of the intercept and not the distribution of the slope.

  • Second, also assuming a linear model, you can use weighted least squares regression. If you have doubts about the (2,2) values, you can assume them to have a large variance and thus a small weight. Similar to the Bayesian setup, this will widen the error bars and draw higher doubt on the result for the intercept.

  • Third, and more generally, if you want a procedure which doubts the model based on the training data, you can use model comparison schemes which will show that other models beside a straight line are possible and probably equally likely (e.g., a parabola will also fit the data well).

However, in the end you are stuck with the model you assume. Hence, you should choose it carefully and with regard to the data.

davidhigh
  • 1,410
  • 9
  • 20
  • Thanks for your comments. Very often in practice we know that the data fit a linear relationship and we are very interested to know $\beta$, however, due to rare events, we often end up with many (0,0). We could argue whether the intercept is zero or not through the model selection process as suggested by you. However, my main concern is that the many (0,0) suppress the SE of β, which is not true by common sense. I have tried adding very high weights to those (0,0) and very low weights to the non (0,0), seems that this gives hardly any effect of increasing the SE of $\beta$. – tiantianchen Jul 08 '14 at 06:57
  • @tiantianchen: "However, my main concern is that the many (0,0) suppress the SE of β, which is not true by common sense." It depends on your model, but I don't think it is true in general. For example, in a Bayesian setup, you will need separated variances for the two parameters, but thenyou will find the variance corresponding to the intercept will decrease much faster. The same seems to be the case for your program output: if I can read it correctly, the significance of the intercept is much higher. – davidhigh Jul 08 '14 at 07:53
  • "However, my main concern is that the many (0,0) suppress the SE of β, which is not true by common sense."

    Why? (0,0) pairs don't contain information about β. Large standard errors make sense. Why should you be able to accurately estimate how Y linearly varies with X, on average, when there is little variation of X in your data set?

    – CloseToC Jul 08 '14 at 09:38