What is the impact of duplicate data on the variance of the regression coefficient?.
Does increasing the size of data always certainly decrease the variance of the model coefficients?
Suppose I have 100 data points. I created another data from the same data by duplicating the original data 100 times. i.e. I have 100,000 data points now. If I run the model on two data sets, what would impact the model coefficients and why?
I appreciate any help you can provide.
Asked
Active
Viewed 1,294 times
2
NAS_2339
- 223
1 Answers
3
The coefficients themselves will no change.
Imagine you perform the analysis on the first dataset, and plot the regression line with the datapoints around the regression line.
Now what would happen if you duplicate the data. You would just stack the datapoints on top of the already existing points. So the regression line will not be moved. What will happen tho, is that the p-values will greatly decrease. As you have more data that supports the original model that was obtained by using the small dataset.
Janosch
- 974
- 4
- 15