1

I was reading ISL and found the following paragraph:

As an extreme example, suppose we accidentally doubled our data, leading to observations and error terms identical in pairs. If we ignored this, our standard error calculations would be as if we had a sample of size 2n, when in fact we have only n samples. Our estimated parameters would be the same for the 2n samples as for the n samples, but the confidence intervals would be narrower by a factor of $\sqrt2$!

That got me thinking - if we use a linear mixed effects model instead of linear regression, surely this shrunken standard error can be fixed.

Test case: carsmall dataset from MATLAB (removed NaN values, 93 observations)

First, I fit a linear regression model on the 93 observations: MPG ~ 1 + Horsepower and I get the following output:

Variable Estimate SE tStat pValue
intercept 39.362 1.3169 29.889 7.7492e-49
Horsepower -0.143 0.011134 -12.844 3.7813e-22

Now, let's say I duplicate the data twice (i.e., three copies of data) and fit the same linear regression model. This is what I get:

Variable Estimate SE tStat pValue
intercept 39.362 0.75482 52.148 2.9874e-145
Horsepower -0.143 0.0063814 -22.409 3.6916e-64

As expected, the standard error is shrunken down and therefore the T stats and p values.

Now, I attempted to fit a linear mixed - my thought process was that if I include a repeated measurement random effects, that ought to give me the same standard error as the original linear regression case.

So, I did the following: MPG ~ 1 + Horsepower + (1 | ObservationID), where I defined ObservationID as 1:93 for the first 93 observations (and repeated it two more times). Therefore, some variance in the MPG ought to be explained away by the fact that there are three repeated measurements. I got the following output:

Variable Estimate SE tStat pValue
intercept 39.362 0.75211 52.336 1.209e-145
Horsepower -0.143 0.0063585 -22.49 1.9403e-64

and the standard deviation for the random intercept is 2.7661 while the standard deviation associated with the independent error term is 0.00013357

This is very similar to the linear regression case where I was not accounting for the fact that these are the same data points.

Question How should I model (probably using mixed-models?) this situation so that I get the "correct" standard errors (where correct implies the standard error from the first situation where the data was not duplicated).

stuckstat
  • 225
  • Is there a reason you cannot remove the duplicated observations from your data? No statistical procedure other than removal of the offending rows will solve this for you. If removing is a non starter, you could mark them with a dummy variable for original/duplicate (0/1) and then restrict analysis to just the originals. – Erik Ruzek Jan 22 '24 at 13:49
  • 1
    I purposely duplicated the data to test this out...can we not "account" for this statistically? In a way, this is repeated measures, right? – stuckstat Jan 22 '24 at 15:47
  • Sorry, I didn't notice that you purposefully duplicated the observations. In a very strange way, yes, it is repeated measures. However the repeated measurements are exact duplicates, which means they have little bearing on residual variance (the measurements are all the same) and amplify group variance (you've reinforced any pre-existing group differences). I work in social sciences and it is very rare to see this kind of variance pattern. Usually there is a lot more "noise" in the data (residual variance). – Erik Ruzek Jan 22 '24 at 19:01
  • Yes, I agree. It is a strange test. Do you think if I introduced a small amount of noise in the repeated data, it would be possible to "fix" the standard errors so that they are no longer leading to inflated statistics? – stuckstat Jan 23 '24 at 10:04
  • You could do that but I'm not sure it is ideal. Instead, I would probably suggest simulating repeated measures data with given levels of residual and cluster variance. You could simulate and run models under differing numbers of repeated measures to see how the standard errors are dealt with in OLS vs. mixed effects models. For example, use the simulation parameters in this post https://stats.stackexchange.com/a/481865/87305 – Erik Ruzek Jan 24 '24 at 01:08
  • Related: https://stats.stackexchange.com/questions/216003/what-are-the-consequences-of-copying-a-data-set-for-ols/216011#216011 – kjetil b halvorsen Feb 14 '24 at 02:18

0 Answers0