Analogy between degrees of freedom in simultaneous equations and regression

Question

Background: I went looking for intuitive explanations for degrees of freedom. I found some analogies that used simultaneous equations and constraints, others that cast them as independent data points in regressions, and others yet that explained it as the number of different directions/ways something could vary. I'm sure they're all correct, but I'm trying to relate them to each other. For example, in simultaneous equations, more constraints and fewer df is good because you can solve for all the unknowns. While in statistics, more df and fewer constraints is good because it's a more reliable estimate. I "know" this but don't understand the exact mechanics.

In simultaneous equations, if you have 10 unknowns X1 through X10, and no equations/constraints relating the variables, you have 10 degrees of freedom. With 10 independent equations/constraints, you have no degrees of freedom, and can solve for the combination of unknowns that will fulfill the constraints.

With 9 independent equations/constraints, df = 1, i.e. you can write everything in terms of 1 unknown, so you really have 1 independent data point, not 10. With 8 independent equations/constraints, df = 2, and you can write everything in terms of 2 unknowns, so you have 2 independent data points.

Now trying to relate this to linear regression. In Y = beta0 + beta1*X + error, I suppose that's 2 independent constraints (beta0 and beta1), so df = n-2. If you have 3 data points, n=3, df=1, and I suppose you can "write" the equation in terms of the 1 "independent" data point? And if you have 4 data points, n=4, df=2, and you can "write" the equation in terms of the 2 "independent" data points? This is where my analogy gets confusing to me. I might be matching the wrong parts to each other in my analogy. I ramble on quite a bit below trying to think this out. Please let me know if you have any corrections to my thinking.

Taking a step back, using just Y = beta0 + error, then beta0 becomes the mean of the observations' Y value, and df = n-1. With n=2, you could write everything in terms of either y1 or y2, so there's only one variable that can vary, and you can write the error term in terms of beta0 and y1, or beta0 and y2. So df=1 around the error term.

If n=3, you can write the error term in terms of beta0 and any 3 choose 2 combo of y1, y2, and y3. So df=2 around the error term. I guess the more df around the error term, the more confident you can be that your estimate of the error term will be 0? How does that work really? With the "constraint" beta0 = (y1 + y2 + y3) / 3, then y1 = 3 * beta0 - y2 - y3. Substituting this constraint into the regression results in 3 * beta0 - y2 - y3 = beta0 + error. Why does this reduce my uncertainty around the error term versus with n=2, where the constraint substituted into the regression equation becomes 2 * beta0 - y2 = beta0 + error? Because I have two independent data points y2 and y3, instead of just y2?

Switching back to regression with one independent variable, the original linear regression equation Y = beta0 + beta1*X + error. If n=3, df=1, so I can now describe the error term in terms of a single data point, either (x1,y1) or (x2,y2) or (x3,y3). I think it's because you have to relate (x1,y1) and (x2,y2) and (x3,y3) to calculate beta0, and again to calculate beta1. So when you substitute those 2 constraints into the regression equation as X and Y, the error term can be written in terms of just one of these data points.

Playing this out, every additional coefficient you add to your regression i.e. polynomials like Y = beta0 + beta1 * X + beta2 * X^2 + error adds a constraint and reduces the number of independent data points by which you can "describe" the error term.

Moving to 3D space by adding an additional regressor variable:

You now have 2 independent variables such that Y = beta0 + beta1 * X1 + beta2 * X2 + error. If n=3, df=0 and that creates a plane. There is no error term because the 3 constraints from calculating beta0, beta1, and beta2 will relate the 3 data points such that when you substitute them into the regression equation via X1, X2, and Y, the error term disappears.

Tim Mak · Accepted Answer · 2020-06-23T06:20:12.027

Simultaneous equations and regression have a lot in common. It's best to illustrate using matrix algebra. To solve a system of linear equations, you can write $$ Xb = c $$ which is shorthand for \begin{align} X_{11}b_1 + X_{12}b_2 + \cdots + X_{1p}b_p &= c_1 \\ X_{21}b_1 + X_{22}b_2 + \cdots + X_{2p}b_p &= c_2 \\ \vdots \\ X_{n1}b_1 + X_{n2}b_2 + \cdots + X_{np}b_p &= c_n \\ \end{align} In regression, you write $$ y = Xb + e $$ Now, you understand that in $b$, you have $p$ variables. However, with $n$ constraints, you have only $p-n$ "effective" variables, hence $p-n$ "degrees of freedom". More precisely, we should be looking at the number of linearly independent constraints rather than simply constraints, since some constraints can be redundant. For example, $1b_1 + 2b_2 = 3$ is equivalent to $2b_1+4b_2=6$. Now, the number of linearly independent constraints corresponds to the rank of the matrix $X$. Thus, more precisely, the degree of freedom is $p-rank(X)$.

Now in the simultaneous equation setting, you were interested in the degree of freedom of what is variable. However, it is also interesting to consider the degree of freedom of what is fixed, in other words $rank(X)$. Now $rank(X)$ is the effective dimension of the space spanned by the column vectors of the matrix $X$. In less technical terms, consider the column vector $X_1=(X_{11}, X_{21}, \ldots, X_{n1})^T$. Now the space spanned by this vector is all the points that can be represented as $aX_{1}$ for some arbitrary $a$. It's clear that this is just one straight line and hence 1 dimension. The space spanned by $X_1$ and $X_2$ is all points that can be reached by $a_1X_1 + a_2X_2$ for arbitrary $(a_1,a_2)$. Supposing $X_1$ and $X_2$ are linearly independent, that would be a plane and hence 2 dimensions. It's difficult to visualize beyond 3 dimensions and hence we refer to the space spanned by vectors as hyperplanes in general. Thus, $rank(X)$ is the dimension of the space spanned by the column vectors $X_1, X_2, \ldots, X_p$. Note that we can write $a_1X_1 + a_2X_2+ \ldots + a_pX_p = Xa$, where $a=(a_1,\ldots,a_p)^T$. Hence, $\{Xb:b\in \mathbb{R}^p \}$ represents the space spanned by the columns of $X$ also.

In least squares regression, we effectively try to look for the point in this space that is closest to the point $y=(y_1,y_2, \ldots, y_n)$. Call this $X\hat{b}$. The residual $\hat{e}=y - X\hat{b}$ represents the vector from this $X\hat{b}$ to $y$. Now, try to visualize this in 3D. Because $X\hat{b}$ is the closest point to $y$, $e$ must be perpendicular to the hyperplane which is $\{Xb:b\in \mathbb{R}^p \}$. In multi-dimensional space, this is called orthogonality. If $rank(X)=2$, then the space $\{Xb:b\in \mathbb{R}^p \}$ is simply a plane. Now there is only one direction (in 3D) that is perpendicular to this plane. Hence, the space spanned by $\hat{e}$ has 1 dimension. In other words, the residuals have 1 degree of freedom. On the other hand, if $rank(X)=1$, then $\{Xb:b\in \mathbb{R}^p \}$ is a line. The direction $\hat{e}$ can take can be envisaged by imagining spinning a needle around a thread, which is the line. Thus $\{\hat{e}\}$ is a plane and has 2 dimensions. In this case, the residual degree of freedom is 2.

In summary, because $X\hat{b}$ and $\hat{e}$ are constrained to be orthogonal, $rank(\{y\})=rank(\{X\hat{b} \}) + rank(\{\hat{e}\})$, and we can imagine the total space of $\{y\}$ to be decomposed into the model space $\{X\hat{b} \}$ and the residual space $\{\hat{e}\}$, much like a standard 3D space into the space spanned by the $(x,y)$ plane and the $(z)$ axis. The degree of freedom (df) represents the dimension of these spaces.

Thank you Tim. I'm afraid I have trouble understanding your solution. In the context of regression, does rank(X) refer to the number of data points, or to the number of constraints i.e. parameters estimated? I would think rank(X) would be the number of data points in a regression, based on the format Y = Xb + e. But using your format from simultaneous equations, rank(X) seems like the number of constraints, or parameters, especially when you say Xb forms a plane if rank(X) = 2. — Guest, Jun 24 '20 at 19:54
If I estimate the same number of parameters in my regression, the model should always be the same shape (e.g. plane) no matter how many additional data points n increases by. But if rank(X) is the number of data points, then adding new data points would change the shape of Xb from a plane to a hyperplane, though we didn't add any parameters. — Guest, Jun 24 '20 at 19:56
It feels like I'm missing a transpose somewhere? In that the Xb doesn't map to the same things when you use it in your simultaneous equations format versus your regression format. — Guest, Jun 24 '20 at 20:01
The rank refers to the dimension of the space spanned by the column vectors. Forget about number of parameters or number of data points. The reason is data points and parameters can be redundant. — Tim Mak, Jun 26 '20 at 03:13
In simultaneous equations it’s actually a little more complex cos you can have impossible constraints. For example if you have $x+y=2$ and $x+y=3$ then you have no solutions. I’m assuming that the the system of equations is feasible. It’s best to avoid talking about the number of constraints because you can arbitrarily generate as many constraints as you like. e.g. if you have two equations, add the two together and you have another. However this third equation would be what i mean by redundant. In fact any one of the three can be redundant. — Tim Mak, Jun 26 '20 at 03:18
Also note that $rank(X)\leq \min(n,p)$ so it’s true that if $n>p$, it’s not affected by the increase of data points. However if $p>n$ then it may be. — Tim Mak, Jun 26 '20 at 03:23

score 0 · Answer 2 · answered Jun 23 '20 at 02:50

Think I figured this one out. The question of (a) how higher df reduces uncertainty is a separate question from (b) the simultaneous equations and equations/constraints being substituted.

The former relates to higher df allowing you to use critical values from a less fat tailed distribution, as well as reducing the variance around estimated model parameters which vary inversely with df. The combination hence reduces the width of confidence and prediction intervals.

The latter goes to how you accurately calculate the (sample) residual variance as an unbiased estimator of (true) error variance. The residual terms will be y1 - y_hat, y2 - y_hat, ..., yn - y_hat. Every additional parameter you estimate in the y_hat model, you add a simultaneous equation or constraint relating your yn variables, so you can substitute into subsequent residual terms and write more of them as functions of already decided variables. These subsequent residual terms are thus not free to vary and not independent of the earlier decided residual terms, so your average squared residual should really have a smaller number in the denominator, hence a higher MSE which turns out to be an unbiased estimator of the true error variance. True error = y - f(x) while sample residuals are y - f(x)_hat. See Bessel's correction for further details.

Can you recommend a good paper or test for simultaneous equations. thanks. — mlofton, Jun 23 '20 at 03:20
I'm not sure I know any. Just to clarify, my understanding is that each new constraint or equation arises when you add an estimated parameter to your model, i.e. beta0_hat = f(y1, y2, ..., yn) so you can write y1 = f(beta0_hat, y2, ..., yn). If you add another estimated parameter i.e. beta1_hat = f(y1, y2, ..., yn) then you can write y2 = f(beta1_hat, y1, y3, ..., yn). Substitute the second equation into the first, and you get y1 = f(beta0_hat, beta1_hat, y3, ..., yn). Remember your residuals are y1 - y_hat etc. so substitute your y1 equation in there, and now you have written y1 in terms of — Guest, Jun 23 '20 at 03:36
...all the other terms. And keep iterating for each additional beta parameter you estimate. It's not paper-worthy I think it's literally just rearranging equations and substituting them in. Once you can write a residual term using the other variables, it means it's not an independent data point, and if you use it to calculate MSE as if it were independent, you will underestimate the true error variance. — Guest, Jun 23 '20 at 03:38

Analogy between degrees of freedom in simultaneous equations and regression

2 Answers2