Why does this multiple linear regression fail to recover the true coefficients?

Question

I am trying to use linear least squares regression to extract the coefficients of a model. Specifically, I am looking at a model with two independent predictor variables $x_1$ and $x_2$, and an output response variable $y$, with coefficients $\beta_0,\beta_1$ and $\beta_2$. (I believe this is a case of multiple linear regression.) The model is of the form $$ y_i = \beta_0 + \beta_1x_{i1} + \beta_2 x_{i2} + \varepsilon_i \tag{1} $$ where $i$ denotes the observation number. Or in matrix form with $N$ observations $$ \begin{align} \mathbf{Y} &= \mathbf{A}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \\ \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix} &= \begin{pmatrix} 1 & x_{11} & x_{21} \\ 1 & x_{12} & x_{22}\\ \vdots & \vdots & \vdots \\ 1 & x_{1N} & x_{2N} \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{pmatrix} + \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2\\ \vdots \\ \varepsilon_N \end{pmatrix} \end{align} \tag{2} $$ I should be able to solve this using the least squares condition $$\boldsymbol{\beta} = (\mathbf{A}^\textrm{T}\mathbf{A})^{-1}\mathbf{A}^\textrm{T}\mathbf{Y} \tag{3}$$.

I made a simple (Matlab) script to see if I could recover the correct beta coefficients on a test case, given below. I used three methods: 1) directly solving Eq.(3) using Matlab's backslash operator 2) the same thing, except neglecting the transpose in Eq.(3), since it coincides with the least squares result. 3) solving Eq.(3) using a QR decomposition insead.

x1_data = linspace(-1,1,20).';      % First independent variable
x2_data = linspace(-0.3,0.6,20).';  % Second independent variable
beta0 = -0.2; % Intercept
beta1 = 0.4;  % coefficient of x1 data
beta2 = 1.2;  % coefficient of x2 data
y_data =  beta0 + beta1x1_data + beta2x2_data; % Create test response data
A = [ ones(length(x1_data),1)   x1_data   x2_data ]; % Design matrix
Y = y_data;
beta_fit1 = (A.'A) \ (A.'Y);  % Method #1
beta_fit2 = A \ Y;              % Method #2
[Q,R] = qr(A);
beta_fit3 = R \ (Q'*Y);         % Method #3

which outputs:

beta_fit1 = [-254.4 -762.3 1696.0]
beta_fit2 = [-0.02 0.94 0]
beta_fit= [-0.02 0.94 0]

None of which match the original input beta coefficients. Can anyone tell me what might be going wrong here?

Thanks

Edit

An example to show that it works when fitting a 2D parabolic function with no noise (i.e. with deterministic data, as mentioned by @Gregg H):

x1_data = linspace(-1,1,20).';      % First predictor variable
x2_data = linspace(-0.3,0.6,20).';  % Second predictor variable
beta0_true = -0.2; % Intercept
beta1_true = 0.4;  % coefficient of x1 data
beta2_true = 1.2;  % coefficient of x2 data
y_data =  beta0_true + beta1_truex1_data.^2 + beta2_truex2_data.^2 ; % Create test response data
A = [ ones(length(x1_data),1)   x1_data.^2   x2_data.^2 ]; % Design matrix
Y = y_data;
beta_fit2 = A \ Y;

which outputs as expected:

beta_fit = [-0.2   0.4   1.2]

The issue is when trying to add a linear term (such as beta4_true * x_data1 ), i.e. a tilt to the parabola.

Your code is labeled with three methods to estimate coefficients, but you don't tell us what the outputs are. You are more likely to get useful answers to this question if you can articulate your process & results in a way that does not depend on users reading, understanding, and running your code. // If I understand the code correctly (I don't use MATLAB), you have a design matrix with linearly dependent columns and y_data has no error term. So ... yeah, there are multiple solutions to the linear system. — Sycorax, May 22 '23 at 15:56
A minimal reproducible example of this phenomenon, which will therefore be helpful to contemplate, is $x_1=(1),$ $y=(0),$ with the model $E[y] = \alpha + \beta x_1.$ Consultant (A) tells you the solution is $\hat\alpha=\hat\beta=0$ while consultant (B) tells you no, it's $\hat\alpha=1,$ $\hat\beta=-1.$ Who is correct, if either? — whuber, May 22 '23 at 18:03
@Sycorax Sure. I have edited the question to try reflect your issues raised. Essentially, all 3 methods are solving Eq. (3), just with slightly different numerical approaches. I did not include an error term, because I wanted to generate some test data, so it should be easy to obtain the fit coefficients exactly (so I thought). Also, Eq.(3) does not contain an error term (even though it was present in Eq, (1) from which it was derived. — teeeeee, May 22 '23 at 22:31
@whuber I like your example. I guess you are tactfully trying to tell me that there are multiple solutions to my setup. I thought that since I generated the y data exactly using the model equation that I should be able to recover the coefficients exactly? Is it not the case? Am I missing something fundamental. Thanks — teeeeee, May 22 '23 at 22:32
Try printing out A and Y, confirming that everything is as it ought to be entering the fitting steps. — jbowman, May 22 '23 at 22:35
There are infinitely many solutions to this system of linear equations. https://en.wikipedia.org/wiki/System_of_linear_equations#Solution_set The reason is that $A$ is not full rank. https://en.wikipedia.org/wiki/Rank_(linear_algebra) The $QR$ decomposition is rank-revealing; the rank of $A$ is the number of pivots in $R$. Because there are not 3 pivots in $R$, you know there is not a unique solution, so there is no such thing as "true" coefficients for this problem. — Sycorax, May 22 '23 at 22:36
@Sycorax okay I see that there is not a unique solution, and how to recognise it. What I dont understand is what it is about my test model which makes it this way. For example, if I change x2data to x1data^2 then it works. I.e if I perform a polynomial regression instead of multiple variable regression. — teeeeee, May 22 '23 at 22:53
The columns of $A$ are linearly dependent, as I stated in my first comment, so they are not a basis. The polynomial basis is a basis because they are linearly independent (cf Vandermonde matrix). Or pick any of the other 22 properties from the invertible matrix theorem. — Sycorax, May 22 '23 at 23:02

Sextus Empiricus · Accepted Answer · 2023-05-23T09:07:55.323

2

At least the 2nd and 3rd solutions are correct.

Your design matrix has dependent variables. For example the third column can be expressed in terms of the first two columns $x_3 = 0.15 + 0.45 x_2$ and the equation can also be expressed as

$$\begin{array}{} -0.2 + 0.4 x_ 1 + 1.2 x_3 &=& -0.2 + 0.4 x_ 1 + 1.2 (0.15 + 0.45 x_2) \\ &=& -0.2 + 0.4 x_ 1 + 1.2 (0.15 + 0.45 x_2)\\ &=& -0.02 + 0.94 x_1 \end{array}$$

Methods 2 and 3

This last equation on the right hand side is the solution given by the 2nd and 3rd methods which probably drop one of the columns. In R you get the same behavior when we use the function lm which gives as output

> lm(y~X+0)
Call:
lm(formula = y ~ X + 0)
Coefficients:
   X1     X2     X3

-0.02   0.94     NA

The last column is ignored when you give a computer the task to solve the equation.

Method 1

Your 1st method probably attempts to inverse the (non-invertible) matrix anyway. For example, the inverse command does give some output. In my computer (an online https://www.tutorialspoint.com/execute_matlab_online.php) I get:

disp(inv(A.'*A));
-3.3777e+13  -1.0133e+14   2.2518e+14
-1.0133e+14  -3.0399e+14   6.7554e+14
 2.2518e+14   6.7554e+14  -1.5012e+15

and gives some output that is close but not exact (possibly due to round of errors).

In my case I got -0.071429 0.785714 0.342857, which is close to a correct solution $-0.071429+0.15 \cdot 0.342857 \approx -0.02$ and $0.785714+0.45 \cdot 0.342857 \approx 0.94$

In your case the difference is larger $-254.4 + 1696\cdot 0.15 \approx 0$ and $-762.3 + 1696*0.45 = 0.9$ (but this might be due to the output being given with less precision)

In R I can get the same result when I use the solve command while setting the tolerance parameter extremely low. In that case the inverse matrix is still computed; and it can be computed because the columns in the matrix X are not entirely dependent due to round off errors.

X = cbind(rep(1,20), seq(-1,1,length.out = 20), seq(-0.3,0.6,length.out = 20))
beta = c(-0.2,0.4,1.2)
y = X %*% beta
X = round(X,5)
solve(t(X) %*% X, tol = 10^-50) %*% t(X) %*% y
#           [,1]
#[1,] -0.0430674
#[2,]  0.8707996
#[3,]  0.1537806

edited May 23 '23 at 09:07

answered May 23 '23 at 07:52

Sextus Empiricus

77,915

Thanks for your answer, and for confirming the result on your machine. So, what I am being steered towards is that the problem is with the fact that my two predictor variables are not independent, and are highly correlated due to the way I have constructed them. (As shown in my edit, I am not convinced that it is because of lack of noise, as has been suggested). So, is it not possible to fit a 2D function in this way if the two predictor vectors are linearly spaced? – teeeeee May 23 '23 at 09:08
@teeeeee the problem with the fit is because of the dependence between the columns and not because of the noise. Sidenote: there can be fitting methods that don't fit well when there is zero noise. The nls function in R, which uses a gradient descent, will function badly when there is no noise. An example command is x = 1:10; nls(1+x+x^2+rnorm(10) ~ a+b*x+c*x^2, start = list(a=0, b=0, c=0) ) which will not work when you remove the noise +rnorm(10). – Sextus Empiricus May 23 '23 at 09:18
In the edit, I have shown that fitting a 2D parabola works correctly (even though the two variables are still correlated, and are not independent). However, adding a tilt to the parabola with another linear term fails. (But adding a cubic term works, for example). What is going on here? – teeeeee May 23 '23 at 09:19
If you insist on using these two predictor variables (for whatever reason), then possibly you could add a second condition for a relationship between the two coefficients, or you can leave out the intercept. – Sextus Empiricus May 23 '23 at 09:21
2

@teeeeee the fitting doesn't work because your columns are linearly dependent. The tilt can already be generated by the linear term that you already have. Adding another is redundant. – Sextus Empiricus May 23 '23 at 09:22
"even though the two variables are still correlated, and are not independent" The problem occurs when they are fully dependent. When we can express one of the columns entirely by a linear combination of the others. You may read some about multicollinearity Why is multicollinearity different than correlation? – Sextus Empiricus May 23 '23 at 09:24
Forgive me (and thanks for your patience :) ) , but what do you mean "the linear term you already have". The test data is generated from the function y = b0 + b1*x1.^2 + b2*x2.^2 , which does not have any linear term (only an offset and two quadratic terms), as far as I can see? – teeeeee May 23 '23 at 09:26
1

@teeeeee the intercept and the two quadratic terms contain indirectly a linear term. You can fill in the $x_2 = 0.15 + 0.45 x_1$ into the equation and write it out in terms of intercept $x_1$ and $x_1^2$ – Sextus Empiricus May 23 '23 at 09:30
Ah, I think I see now. So we would expect everything to work fine if the two variables were randomly generated (instead of linspace() ), right? (even with the extra linear term) – teeeeee May 23 '23 at 09:34

Why does this multiple linear regression fail to recover the true coefficients?

1 Answers1

Methods 2 and 3

Method 1

Linked