Training r-squared decreases after adding higher degree polynomial terms to regression model

Question

I was playing around with some examples to get some experience using the PolyFeatures tool from Scikit-Learn, and I ran into something strange. I iteratively added higher and higher degree polynomial features to my regression model, and this would occasionally cause the r-squared value for the model to decrease, which should not be possible.

I originally noticed this while working with the Boston Housing dataset, but here is a simple example demonstrating the issue:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

np.random.seed(1)
n = 500
x1 = np.random.uniform(0, 3, n)
x2 = np.random.uniform(0, 3, n)
x3 = np.random.uniform(0, 3, n)
y = 3 + 0.01 * x1**3 + 0.02 * x2**2 + 0.03 * x2*x3 + np.random.normal(0, 0.2, n)

X = np.vstack((x1, x2, x3)).transpose()

for d in range(1, 9):
    poly = PolynomialFeatures(d)
    Xp = poly.fit_transform(X)

    mod = LinearRegression()
    mod.fit(Xp, y)

    print('Degree', d, '- Training r-Squared:', mod.score(Xp, y))

The output of this code is:

Degree 1 - Training r-Squared: 0.2773006611069333
Degree 2 - Training r-Squared: 0.3168358821057937
Degree 3 - Training r-Squared: 0.33258321401873814
Degree 4 - Training r-Squared: 0.3160261669178669
Degree 5 - Training r-Squared: 0.3729512734983266
Degree 6 - Training r-Squared: 0.3234788901084178
Degree 7 - Training r-Squared: 0.24399386671590273
Degree 8 - Training r-Squared: 0.42981336522995917

Notice that r-squared drops on three occasions as the degree of the model increases (from 3-4, 5-6, and 6-7).

Any ideas why this is happening? Thanks in advance!

I came up with a simpler example involving simple linear regression. I ran it in Python and got the same unexpected results. Then I ran it in R, and everything looked normal. I was able to get the expected behavior in Python after simple scaling the features.

Could you explain why a decrease in R-squared should not be possible? — whuber, Feb 01 '19 at 21:25
Adding additional features to a regression model will almost always increase (and never decrease) the r-squared value on the training data.
Each time we add higher degree polynomial features, we are strictly increasing the size of the hypothesis space. The minimum SSE on the larger hypothesis space can not be larger than the minimum SSE on the smaller space. A reduction in the value of SSE implies an increase in the r-squared value.

That might not be true if sklearn was reporting a penalize metric, such as adjusted r-squared. But I checked the documentation, and that is not the case. — Beane, Feb 01 '19 at 21:49
Although all that is mathematically correct, none of it actually refers to the calculations that you are carrying out. That suggests taking a (graphical) look at the fits. When you try to do that, you will be motivated to simplify your example--and that will help you solve the problem and/or illustrate it for your readers. — whuber, Feb 01 '19 at 22:04
Taking your advice, I tried to find a smaller example. I did find an example using simple linear regression, with 50 observations, and with y being uncorrelated noise. I then created scatter plots of the data and the fitted curve hoping for some illumination. Unfortunately, it did not tell me much. Nothing really stood out to me in the fitted curves. — Beane, Feb 02 '19 at 02:45
I can't tell if you know the answer and are just trying to lead me to it or not. But, the way you are talking, it sounds like the issue has to have something to do with some sort of computational issue such as rounding error or overflow error. Is that on the right track? As far as I can tell, overflow is not the issue, but I suppose it could be rounding/truncation. I am drawing a blank though. — Beane, Feb 02 '19 at 02:45
Or... Instead of rounding or overflow error, I suppose the that the issue could be convergence. I had't really thought about it before, but I suppose the multicollinearity would likely produce a long narrow "valley" in the loss function, that might make it difficult to find the optimal solution. I need to think about that more carefully. — Beane, Feb 02 '19 at 06:08
@whuber What you write here seems to be at odds with what you wrote here. Could you please clarify? — Dave, Jun 17 '21 at 14:42
@Dave The meaning of "R squared" in the present situation is unclear and I suspect it differs from the usual one. That's why I recommended diagnostics, simplification, and attention to the calculations being performed. — whuber, Jun 20 '21 at 17:31

Simon Segert · Answer 1 · 2022-10-21T01:50:52.173

In fact the reason has to do with colinearity among the features, but not in the way suggested by the other answer. Rather, it has to do with the default parameter settings of the PolynomialFeatures and LinearRegression classes. If you read the documentation of the respective classes, you will see that PolynomialFeatures class has an "include_bias" attribute, which by default is set to True. This means that the first column of the returned feature matrix will be all ones. Similarly, the LinearRegression class has a "fit_intercept" attribute which is also set to True by default. This means that before the linear regression is performed, an extra column of 1s will be appended to the feature matrix. However the feature matrix already contains a column of 1s, so we will have colinearity in the form of two identical columns.

We can verify this by examining the values of the coefficients of the regression:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1)
n = 500
x1 = np.random.uniform(0, 3, n)
x2 = np.random.uniform(0, 3, n)
x3 = np.random.uniform(0, 3, n)
y = 3 + 0.01 * x13 + 0.02 * x22 + 0.03 * x2*x3 + np.random.normal(0, 0.2, n)
X = np.vstack((x1, x2, x3)).transpose()
d=9
poly = PolynomialFeatures(d)
Xp = poly.fit_transform(X)
mod = LinearRegression()
mod.fit(Xp, y)
print(mod.intercept_,np.abs(mod.coef_).max()

the output is 285693288.6867925 285693277.157223,thus suggesting an issue of numerical overflow, as would be expected when there is perfect colinearity among the predictors. Now consider instead the following code, where the only difference is that we have not included an intercept term in the LinearRegression class.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1)
n = 500
x1 = np.random.uniform(0, 3, n)
x2 = np.random.uniform(0, 3, n)
x3 = np.random.uniform(0, 3, n)
y = 3 + 0.01 * x13 + 0.02 * x22 + 0.03 * x2*x3 + np.random.normal(0, 0.2, n)
X = np.vstack((x1, x2, x3)).transpose()
d=9
poly = PolynomialFeatures(d)
Xp = poly.fit_transform(X)
mod = LinearRegression(fit_intercept=False)
mod.fit(Xp, y)
print(mod.intercept_,np.abs(mod.coef_).max()

The output is 0.0 247.7975362818573, so the coefficients take much more reasonable values and numerical overflow is likely not an issue (albeit, the problem is still fairly ill-conditioned).

Finally, when I run your exact code, except setting fit_intercept=False, i get the following output:

Degree 1 - Training r-Squared: 0.2773006611069332
Degree 2 - Training r-Squared: 0.31683588210579394
Degree 3 - Training r-Squared: 0.33258321401873825
Degree 4 - Training r-Squared: 0.3496077641849682
Degree 5 - Training r-Squared: 0.37578112218309345
Degree 6 - Training r-Squared: 0.40396955460657147
Degree 7 - Training r-Squared: 0.4387038406549014
Degree 8 - Training r-Squared: 0.4998373846906773

The r squared values now increase monotonically, as they should.

Thank you for posting this! Could you please expand on why perfect collinearity results in numerical overflow? — Dave, Oct 21 '22 at 01:58
@Dave if there is perfect collinearity in the predictor matrix $X$, then the matrix $X^TX$ will not be invertible. The issue is that the solution to the OLS requires inverting this matrix: $\beta=(X^TX)^{-1}X^Ty$. So depending on the details of the implementation, you will get overflow or nans (basically a multi-dimensional version of trying to divide by zero). — Simon Segert, Oct 21 '22 at 11:54

ganga · Answer 2 · 2019-02-02T04:18:03.770

1

Although, adding new features increase the R-squared value in general. But adding higher polynomial features can induce multi collinearity into the model. This makes the parameter coefficients highly unstable tilting from positive to negative. Since R-squared = ((Total Sum of squares before regression) - (Residual sum of squares))/Residual sum of squares. In this case, the variation in RSS might reduce the R-squared value (from 3-4, 5-6 etc). Though my answer is an assumption,we can calculate the TSS and RSS from the data and verify wherever we observe declining in the R-squared.

edited Feb 02 '19 at 04:18

answered Feb 01 '19 at 22:34

ganga

21
2

5

This is probably the reason, but please note that lack of collinearity is not one of the assumptions of the model. (Indeed, the model permits perfect collinearity.) Moreover, the problematic behavior stems as much from the details of the PolynomialFeatures function as it does from anything else: a good implementation would compute orthonormal polynomials precisely to avoid these purely computational problems. But you needn't assume this is the reason: you can check whether it's correct by producing a smaller example and inspecting it. – whuber Feb 01 '19 at 22:39
you are right about lack of collinearity as assumptions. Thanks. I edited the answer. – ganga Feb 02 '19 at 04:18

Training r-squared decreases after adding higher degree polynomial terms to regression model

2 Answers2