I was reading ISL and found the following paragraph:
As an extreme example, suppose we accidentally doubled our data, leading to observations and error terms identical in pairs. If we ignored this, our standard error calculations would be as if we had a sample of size 2n, when in fact we have only n samples. Our estimated parameters would be the same for the 2n samples as for the n samples, but the confidence intervals would be narrower by a factor of $\sqrt2$!
That got me thinking - if we use a linear mixed effects model instead of linear regression, surely this shrunken standard error can be fixed.
Test case: carsmall dataset from MATLAB (removed NaN values, 93 observations)
First, I fit a linear regression model on the 93 observations:
MPG ~ 1 + Horsepower and I get the following output:
| Variable | Estimate | SE | tStat | pValue |
|---|---|---|---|---|
| intercept | 39.362 | 1.3169 | 29.889 | 7.7492e-49 |
| Horsepower | -0.143 | 0.011134 | -12.844 | 3.7813e-22 |
Now, let's say I duplicate the data twice (i.e., three copies of data) and fit the same linear regression model. This is what I get:
| Variable | Estimate | SE | tStat | pValue |
|---|---|---|---|---|
| intercept | 39.362 | 0.75482 | 52.148 | 2.9874e-145 |
| Horsepower | -0.143 | 0.0063814 | -22.409 | 3.6916e-64 |
As expected, the standard error is shrunken down and therefore the T stats and p values.
Now, I attempted to fit a linear mixed - my thought process was that if I include a repeated measurement random effects, that ought to give me the same standard error as the original linear regression case.
So, I did the following: MPG ~ 1 + Horsepower + (1 | ObservationID), where I defined ObservationID as 1:93 for the first 93 observations (and repeated it two more times). Therefore, some variance in the MPG ought to be explained away by the fact that there are three repeated measurements. I got the following output:
| Variable | Estimate | SE | tStat | pValue |
|---|---|---|---|---|
| intercept | 39.362 | 0.75211 | 52.336 | 1.209e-145 |
| Horsepower | -0.143 | 0.0063585 | -22.49 | 1.9403e-64 |
and the standard deviation for the random intercept is 2.7661 while the standard deviation associated with the independent error term is 0.00013357
This is very similar to the linear regression case where I was not accounting for the fact that these are the same data points.
Question How should I model (probably using mixed-models?) this situation so that I get the "correct" standard errors (where correct implies the standard error from the first situation where the data was not duplicated).