Background: I went looking for intuitive explanations for degrees of freedom. I found some analogies that used simultaneous equations and constraints, others that cast them as independent data points in regressions, and others yet that explained it as the number of different directions/ways something could vary. I'm sure they're all correct, but I'm trying to relate them to each other. For example, in simultaneous equations, more constraints and fewer df is good because you can solve for all the unknowns. While in statistics, more df and fewer constraints is good because it's a more reliable estimate. I "know" this but don't understand the exact mechanics.
In simultaneous equations, if you have 10 unknowns X1 through X10, and no equations/constraints relating the variables, you have 10 degrees of freedom. With 10 independent equations/constraints, you have no degrees of freedom, and can solve for the combination of unknowns that will fulfill the constraints.
With 9 independent equations/constraints, df = 1, i.e. you can write everything in terms of 1 unknown, so you really have 1 independent data point, not 10. With 8 independent equations/constraints, df = 2, and you can write everything in terms of 2 unknowns, so you have 2 independent data points.
Now trying to relate this to linear regression. In Y = beta0 + beta1*X + error, I suppose that's 2 independent constraints (beta0 and beta1), so df = n-2. If you have 3 data points, n=3, df=1, and I suppose you can "write" the equation in terms of the 1 "independent" data point? And if you have 4 data points, n=4, df=2, and you can "write" the equation in terms of the 2 "independent" data points? This is where my analogy gets confusing to me. I might be matching the wrong parts to each other in my analogy. I ramble on quite a bit below trying to think this out. Please let me know if you have any corrections to my thinking.
Taking a step back, using just Y = beta0 + error, then beta0 becomes the mean of the observations' Y value, and df = n-1. With n=2, you could write everything in terms of either y1 or y2, so there's only one variable that can vary, and you can write the error term in terms of beta0 and y1, or beta0 and y2. So df=1 around the error term.
If n=3, you can write the error term in terms of beta0 and any 3 choose 2 combo of y1, y2, and y3. So df=2 around the error term. I guess the more df around the error term, the more confident you can be that your estimate of the error term will be 0? How does that work really? With the "constraint" beta0 = (y1 + y2 + y3) / 3, then y1 = 3 * beta0 - y2 - y3. Substituting this constraint into the regression results in 3 * beta0 - y2 - y3 = beta0 + error. Why does this reduce my uncertainty around the error term versus with n=2, where the constraint substituted into the regression equation becomes 2 * beta0 - y2 = beta0 + error? Because I have two independent data points y2 and y3, instead of just y2?
Switching back to regression with one independent variable, the original linear regression equation Y = beta0 + beta1*X + error. If n=3, df=1, so I can now describe the error term in terms of a single data point, either (x1,y1) or (x2,y2) or (x3,y3). I think it's because you have to relate (x1,y1) and (x2,y2) and (x3,y3) to calculate beta0, and again to calculate beta1. So when you substitute those 2 constraints into the regression equation as X and Y, the error term can be written in terms of just one of these data points.
Playing this out, every additional coefficient you add to your regression i.e. polynomials like Y = beta0 + beta1 * X + beta2 * X^2 + error adds a constraint and reduces the number of independent data points by which you can "describe" the error term.
Moving to 3D space by adding an additional regressor variable:
You now have 2 independent variables such that Y = beta0 + beta1 * X1 + beta2 * X2 + error. If n=3, df=0 and that creates a plane. There is no error term because the 3 constraints from calculating beta0, beta1, and beta2 will relate the 3 data points such that when you substitute them into the regression equation via X1, X2, and Y, the error term disappears.