2

Suppose we have a big data set where even if I break it into 10 smaller pieces the number of data points in each piece still far outnumbers the number of variables. Now if I run regression using two methods:

  1. run OLS on the entire dataset
  2. break data set into 10 random subsets, run a separate OLS on each. Then take the average of the coefficients.

I would like to know

  • Would the coefficients from the two approach would be the same?
  • Would the standard errors of the coefficients be the same?
  • Overall is there any advantage of one over the other? (outside of computation considerations)

Thank you all in advance!

wwyws
  • 331
  • 1
  • 7

1 Answers1

3

Imagine you have the following dataset of measured $(x,y)$ pairs (each row is a measured $(x,y)$ pair): $$ \begin{pmatrix} 0 & 0 \\ 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \\ 1 & 1 \\ \end{pmatrix} $$ I.e., you have three $y$-values for $x=0$, two of them zero and one equal to one, and at $x=1$ you have one $y$-value equal to zero and two equal to one.

Next, presume that your random partitioning creates the following two partitions: $$ P_1 = \begin{pmatrix} 0 & 0 \\ 0 & 0 \\ 1 & 0 \end{pmatrix}, \quad P_2 = \begin{pmatrix} 0 & 1 \\ 1 & 1 \\ 1 & 1 \end{pmatrix}. $$ Let's compare the regression results.

The important point is that OLS takes at each $x$ the average. So, while OLS for the complete dataset gives you the approximations $$ \begin{align} \hat y(x=0) &= 1/3\\ \hat y(x=1) &= 2/3 \end{align} $$ the average of the approximations over the partitioned data gives: $$ \begin{align} \hat y^p(x=0) &= \frac{avg\{y \,|\, (x,y)\in P_1, x=0\} + avg\{y\,|\,(x,y)\in P_2, x=0\}}{2}\\ &= \frac{0+1}{2}\\ &= \frac{1}{2}\\ \hat y^p(x=1) &= \frac{avg\{y \,|\, (x,y)\in P_1, x=1\} + avg\{y\,|\,(x,y)\in P_2, x=1\}}{2}\\ &= \frac{0+1}{2}\\ &= \frac{1}{2}\\ \end{align} $$

So we see that averaging the OLS results from partitions gives wrong results, and the method without partitioning is to be preferred. Note, that the wrong results from the method using partitioning can clearly have arbitrarily large deviations from the correct estimations which obviates the question w.r.t. standard errors of the coefficients.

The deeper reason for this phenomenon is that the averaging of sub-averages screws up the proper weighting of your data. But you cannot fix this by using weighted averages because the weights are usually different for different $x$.

frank
  • 10,797
  • Thank you. Would this also apply when the data are continuous (where it's unlikely two observations share same X or Y) and have much bigger sizes? – wwyws Feb 16 '22 at 16:05
  • Sure. The fitted $\hat y$ at a point $x$ is not only influenced by the measured $y$ at $x$ but also (more or less) by those nearby. The "more or less" part is determined by your model and optimization procedure. This "nearby"-concept is the whole idea of regression. – frank Feb 16 '22 at 16:39