I was playing around with some examples to get some experience using the PolyFeatures tool from Scikit-Learn, and I ran into something strange. I iteratively added higher and higher degree polynomial features to my regression model, and this would occasionally cause the r-squared value for the model to decrease, which should not be possible.
I originally noticed this while working with the Boston Housing dataset, but here is a simple example demonstrating the issue:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1)
n = 500
x1 = np.random.uniform(0, 3, n)
x2 = np.random.uniform(0, 3, n)
x3 = np.random.uniform(0, 3, n)
y = 3 + 0.01 * x1**3 + 0.02 * x2**2 + 0.03 * x2*x3 + np.random.normal(0, 0.2, n)
X = np.vstack((x1, x2, x3)).transpose()
for d in range(1, 9):
poly = PolynomialFeatures(d)
Xp = poly.fit_transform(X)
mod = LinearRegression()
mod.fit(Xp, y)
print('Degree', d, '- Training r-Squared:', mod.score(Xp, y))
The output of this code is:
Degree 1 - Training r-Squared: 0.2773006611069333
Degree 2 - Training r-Squared: 0.3168358821057937
Degree 3 - Training r-Squared: 0.33258321401873814
Degree 4 - Training r-Squared: 0.3160261669178669
Degree 5 - Training r-Squared: 0.3729512734983266
Degree 6 - Training r-Squared: 0.3234788901084178
Degree 7 - Training r-Squared: 0.24399386671590273
Degree 8 - Training r-Squared: 0.42981336522995917
Notice that r-squared drops on three occasions as the degree of the model increases (from 3-4, 5-6, and 6-7).
Any ideas why this is happening? Thanks in advance!
I came up with a simpler example involving simple linear regression. I ran it in Python and got the same unexpected results. Then I ran it in R, and everything looked normal. I was able to get the expected behavior in Python after simple scaling the features.
Each time we add higher degree polynomial features, we are strictly increasing the size of the hypothesis space. The minimum SSE on the larger hypothesis space can not be larger than the minimum SSE on the smaller space. A reduction in the value of SSE implies an increase in the r-squared value.
That might not be true if sklearn was reporting a penalize metric, such as adjusted r-squared. But I checked the documentation, and that is not the case.
– Beane Feb 01 '19 at 21:49