Very often in the research we want to establish a linear relationship without intercept: $y=\beta x + \epsilon$ and the sample has many double zero observations $x=0, y=0$. I am wondering how to deal with these many zeros. On one hand, they are the actual observations and we should not exclude them. On the other hand, they do not contribute estimating the real value of $\beta$, because with zero-zero observations, any $\beta$ value would be correct.
The result of including all the zero-zero in the linear regression would end up with an estimated $\beta$ which is significant but relying on the only few non zero-zero observations.
Here is an extreme example of response $y$ and exploratory variable $x$. Both of them contains 100 zeros and only 2 non zeros.
x <- c(rep(0,100), 5, 10)
y <- c(rep(0,100), 10, 20)
fit <- lm(y~x-1)
summary(fit)
The model result shows that the estimated value of $\beta$ is 2 and it is significantly different from zero.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 2 0 Inf <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0 on 101 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: Inf on 1 and 101 DF, p-value: < 2.2e-16
The main purpose is to estimate what the value of β is. In this example, I could argue that with 100 zero-zero, β could be of any value, but the model output fixes it to 2, just based on the 2 non zero-zero values. In this case where many zero-zero observations are in the sample, is it the proper way to establish the linear relationship?
y <- c(rep(0,100), 10, 19)and note that the confidence interval around $\beta$ is highly precise and this is somewhat counter-intuitive. – Stéphane Laurent Jul 07 '14 at 20:45