Say I have N observations, possibly multiple factors and I repeat each observation twice (or M times) how would a regression on this new set of size NM compare to a regression on just the original observations?
2 Answers
Conceptually, you are adding no "new" information, but you "know" that information more precisely.
This would therefore result in the same regression coefficients, with smaller standard errors.
For example, in Stata, the expand x function duplicates each observation x times.
sysuse auto, clear
regress mpg weight length
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0038515 .001586 -2.43 0.018 -.0070138 -.0006891
length | -.0795935 .0553577 -1.44 0.155 -.1899736 .0307867
_cons | 47.88487 6.08787 7.87 0.000 35.746 60.02374
------------------------------------------------------------------------------
expand 5
regress mpg weight length
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0038515 .0006976 -5.52 0.000 -.0052232 -.0024797
length | -.0795935 .0243486 -3.27 0.001 -.1274738 -.0317131
_cons | 47.88487 2.677698 17.88 0.000 42.61932 53.15043
------------------------------------------------------------------------------
As you can see, formerly insignifcant coefficients (length) become statistically significant in the expanded model, representing the precision with which you "know" what you know.
- 5,773
- 8
- 38
- 36
-
Yes standard errors do indeed go down. Some recommend weighted linear regression for this.. Is there a method you use to fix this? – BBSysDyn Mar 13 '15 at 16:53
Ordinary linear regression solves the problem $$w^* = \mbox{argmin}_w ||Xw - y||^2$$ where $X$ is the matrix of predictors and $y$ is the response. If you repeat each sample $M$ times, it would leave the objective function to be minimized unchanged (except for a multiplicative factor $M$). Therefore the weight vector that is optimum for the larger problem would be the same as for the original smaller problem.
- 1,158
-
Agreed, but i think t stats and standard errors should change given the change from N to NM? – Palace Chan Dec 12 '11 at 15:46
-
Since OLS assumes that the noise is independent, the standard error would be different because the number of degrees of freedom would be $M*N - P$ ($N$ is original sample size and $P$ is the number of predictors) and the length of the residual vector goes up by a factor of $M$. – Innuo Dec 12 '11 at 16:45