1

Trying to do CV for my polynomial regressor. However, for some polynomial degrees:as the polynomial degree increases, R^2 decreases (e.g., R^2 for degree 2 is 0.6 while for degree 3 is 0.28), why is that?

lin_regressor = LinearRegression()

pass the order of your polynomial here

poly = PolynomialFeatures(1)

convert to be used further to linear regression

X_transform = poly.fit_transform(x_train)

fit this to Linear Regressor

linear_regg=lin_regressor.fit(X_transform,y_train)

linear_regg.coef_

from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import KFold
    from sklearn.metrics import r2_score

crossvalidation_poly = KFold(n_splits=3, shuffle=True)

#for train_index, test_index in crossvalidation_poly.split(X_normalized):

for i in range(1,11): poly_cross_validation = PolynomialFeatures(degree=i) X_current = poly.fit_transform(X_normalized) model = lin_regressor.fit(X_current, y_for_normalized) scores = cross_val_score(model, X_current,y_for_normalized, scoring='r2', cv=crossvalidation_poly, n_jobs=1)

print("\n\nDegree-"+str(i) +" polynomial: R^2 for every fold: " + str(np.abs(scores)))

      #+" training: " + str(np.abs(train_index))+" \ntesting: " + str(np.abs(test_index)))

print('\033[1m'+"Degree-"+str(i)+ '\033[1m'+ " polynomial: Average R^2 for all the folds: " + str(np.mean(np.abs(scores))) + '\033[0m'+ ", STD: " + str(np.std(scores)))


Degree-1 polynomial: R^2 for every fold: [0.41300831 0.45801624 0.17011995] Degree-1 polynomial: Average R^2 for all the folds: 0.34704816498535956, STD: 0.2860884371794798

Degree-2 polynomial: R^2 for every fold: [0.75123033 0.85035531 0.40642591] Degree-2 polynomial: Average R^2 for all the folds: 0.6693371814650284, STD: 0.19025980734977752

Degree-3 polynomial: R^2 for every fold: [0.30689692 0.1496736 0.38827092] Degree-3 polynomial: Average R^2 for all the folds: 0.28161381160006743, STD: 0.23675178460286633

Degree-4 polynomial: R^2 for every fold: [0.7209975 0.40749117 0.84886534] Degree-4 polynomial: Average R^2 for all the folds: 0.6591180032208857, STD: 0.18542670407038087

Z47
  • 31
  • 1
    Welcome to Cross Validated! Why do you think this behavior shouldn’t happen? – Dave Aug 11 '22 at 03:24
  • @Dave, thank you. My understanding is that if I have n data points a perfect fit for the data will be for a polynomial with (n-1) degree. As the polynomial degree increases, it'll fit the data better (yet, the model will become more complex and prone to overfitting)? But maybe you can correct my understanding... – Z47 Aug 11 '22 at 03:26
  • 1) I don’t think you reach $(n-1)$-degree polynomials, do you? $\text{//}$ 2) Why shouldn’t the $R^2$ decrease and then increase again? $\text{//}$ 3) I disagree with the sklearn implementation of out-of-sample $R^2$, and I wonder if this is an instance of their metric having unexpected behavior (though I’m not convinced that it’s so weird). – Dave Aug 11 '22 at 03:33
  • @Dave thank you. 1) No. 2) I was hoping you'll be able to shed more light on this (why?). 3) I get what you're implying (would you please elaborate more on why you don't think it's weird?) – Z47 Aug 11 '22 at 03:38
  • It willl be highly insightful to get a sense for why you think this should not happen. – Dave Aug 11 '22 at 03:48
  • You mean the data points are scattered in such a way that they fit into a 2nd degree polynomial better than a 3rd degree? but my understanding is that as the polynomial degree increases it'll be able to fit more points. – Z47 Aug 11 '22 at 03:53
  • That's how it works in-sample. Out-of-sample, all bets are off. – Dave Aug 11 '22 at 03:56
  • @Dave Thank you. – Z47 Aug 11 '22 at 03:59
  • If you use mean squared error as the metric instead of $R^2$, what happens? Your code doesn't compile for me because NameError: name 'PolynomialFeatures' is not defined, so I can't go try it myself. – Dave Aug 11 '22 at 04:10
  • @Dave I've updated it. Thank you. – Z47 Aug 11 '22 at 04:15
  • How many data points you have, and how many data points are there per fold? – dipetkov Aug 11 '22 at 20:16
  • @dipetkov 100 datapoints, k=3. – Z47 Aug 11 '22 at 20:39

2 Answers2

2

Tough to say for sure, but I anticipate the model may be overparameterized and hence have poor out of sample generalizability.

This is easy to see with a small example. Here, I've generated data from a degree 3 polynomial and cross validated over the degree for PolynomialFeatures. Here are the results of a 10 fold cross validation on rsquared

enter image description here

As you can see, as the degree increases sufficiently, the r squared declines much like in your example. This has to do with the bias variance trade off. As we add more parameters to the model (higher degree) the model becomes more variable as it loses bias (less biased in so far as it can now represent a broader class of functions). You're seeing the effects of this variability.

  • But why would it decrease and then increase again? My rationale is that we start out with high bias and low variance; then move to some bias and some variance; then move to slightly less bias but high high variance; and then move to miniscule bias and only slightly higher variance. Does this seem reasonable to you? – Dave Aug 11 '22 at 04:57
  • Thank you for spending your time to answer this concern. It is much appreciated. So is there a way around this? or shall I just go with the polynomial that gives the highest R^2? – Z47 Aug 11 '22 at 04:58
  • maybe using decision trees (has low biased) will fit the data better then? – Z47 Aug 11 '22 at 05:08
  • @Dave We know about the relation between MSE and bias and variance. It is important to know that MSE appears in the expression of $R^2$. What happens to MSE as we take the model from high bias/low variance to low bias/high variance? How might that affect the $R^2$? – Demetri Pananos Aug 11 '22 at 13:25
  • @Z47 Typically, when you do cross validation you pick the model which optimizes your metric. So yes, pick the degree with the largest cross validated R squared. Additionally, a decision tree is INCREDIBLY high variance so I anticipate things would get worse. – Demetri Pananos Aug 11 '22 at 13:27
  • @DemetriPananos thanks a bunch. I appreciate the time you put to post your answer. Would you suggest a specific ML algorithm that might work well given the behaviour of my dataset? – Z47 Aug 11 '22 at 13:48
  • @Z4L It is impossible to recommend an approach given the information you provide. Try a bunch of models, validate them carefully, pick the best performing one. – Demetri Pananos Aug 11 '22 at 13:52
  • @DemetriPananos thank you. One last question, when I run my ML model every time it uses new random data points for testing and training. So, every time I run my code, I end up with different Average R^2 value for my kfold for every polynomial degree. So (to evaluate the model) shall I keep running the code 4-5 times then take the average R^2 of the average kfold R^2? – Z47 Aug 11 '22 at 13:58
  • @Z47. Fix the random state of the cross validation. There should be documentation on doing this. Fixing the random stats ensures you do not pick a model based on fortuitous splits in the long run. – Demetri Pananos Aug 11 '22 at 14:03
1

You don't provide a minimal reproducible example, so there is room for one more guess about the issue you report.

In a comment you say that you are working with 100 data points. So when you do 3-fold cross validation you end up training the model on 66, 67 and 67 points and evaluating it on another 34, 33 and 33 data points.

Obviously that's not much data to train a model on. The polynomial features make it even more challenging because polynomials are bad at extrapolating outside of the observed range.

So my stab at what's happening is that: (a) your data is small, so the in-fold range and the out-fold range of the predictor(s) end up being "different enough"; (b) you use polynomials which are bad at extrapolating outside of the range of the predictor(s) observed during training; and (c) you compound the problem with the intrinsic variability/instability of high-degree polynomials by using a different 3-fold split for each polynomial degree.

PS: There is another issue with how you do cross validation: You normalize the entire training data first, using all 100 points. However, the normalization step (I assume this is mean 0, variance 1 scaling?) is part of the modeling pipeline, so it should be cross validated as well.

dipetkov
  • 9,805
  • Thanks a bunch, I thought about reasons (a) and (c), reason (b) though, never came to my mind. Thank you for providing such an informative answer! – Z47 Aug 11 '22 at 22:13