Why is R^2 negative in my multiple linear regression model in python?

Question

i want to evaluate the organization based on the number of satisfied customers and the r^2 is negative this is the original data

    SECTOR  price   profit  INSPECTION  licenses    PR  CS(S)   CS(NOT S)   A(s)    A (nonS)
0   A   3809    1643    6834.0  499.0   4053    203.0   45.0     NaN        NAN
1   B   18608   16270   6828.0  2815.0  10923   35.0    5.0      1980.0     200
2   C   3814    1861    2375.0  509.0   2107    99.0    43.0     NaN        NaN
3   A   15869   20293   2595.0  2206.0  5285    30.0    5.0      1150.0     NaN
4   B   5663    1881    3629.0  734.0   5667    220.0   55.0     NaN        565.0

and A(s) stands for the number of satisfied customers for the whole sectors meaning 200 include the services provided by sector A B and C

i foucused on sector B and does it affect A(s) or not

converted Sector so dummy and then deleted A and C sectors this is what i have now

df1.corr()
price        profit     INSPECTION  licenses        PR       A(s)
CLEARANCE   1.000000    0.376304    0.211653    -0.044924   0.397780 0.389236
PERMITS     0.376304    1.000000    -0.021812   -0.158237   0.089504    0.373245
INSPECTION  0.211653    -0.021812   1.000000    0.573478    0.438797    0.245204
Facilities licenses     -0.044924   -0.158237   0.573478    1.000000    0.050931    -0.164353
PR  0.397780    0.089504    0.438797    0.050931    1.000000    0.497360
x = np.array(df1.drop(['A(s)'], axis=1))
y = df1['A(s)'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Predicting the Test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(regressor.predict(X_test), y_test)

the result of r^2 -0.6052320362843366 i do not know why the sign is negative please help and thank you

Dave · Answer 1 · 2023-01-11T06:35:43.687

It means that your sum of squared residuals is greater than the sum of squared residuals of a model that always predicts the out-of-sample mean. This can be regarded as a baseline, “must beat” model. That you cannot achieve stronger performance than this baseline model means that your model is not doing a good job of predicting. While this might seem disappointing, you do out-of-sample testing to catch when you have such a model, so not all is lost.

I dislike the sklearn implementation of out-of-sample $R^2$ and find it to lack motivation. However, I would expect your training and testing means to be similar, so the $R^2$ you’ve reported is unlikely to differ much from what I would get from my preferred calculation.

Note that a linear regression, fitted using ordinary least squares, is guaranteed to have a non-negative in-sample $R^2$ as long as the model contains an intercept. Deviating from that situation, however, either because you use a nonlinear model, a linear model estimated by a method other than minimizing the sum of squared residuals, excluding an intercept, or measuring out-of-sample $R^2$, removes such a guarantee, as you have seen from your result.

can you please tell me how to fix this problem and should i use another model? — kjnk, Jan 11 '23 at 06:18
It’s a matter of the usual strategies for predictive modeling. Beyond that, your question is essentially asking how to do machine learning. Do you have a more specific question? If so, please consider posting a new question to ask. — Dave, Jan 11 '23 at 06:31

score -1 · Answer 2 · answered Jan 11 '23 at 07:58

-1

This looks like a bug.

If you get negative r^2 on your test data (IIRC) with linear regression, it means your test data has a diff mean from your training data (or you are lacking an intercept).

sklearn defaults to using intercept.

Since you are doing a train test split and nothing else, you would not expect the data sets to differ.

I suspect that changing the random state to a non zero number might help.maybe 0 means that it doesn't shuffle the data and your data is ordered somehow?

answered Jan 11 '23 at 07:58

seanv507

6,743

1

Sorry, this is not correct. It is fine to get $R^2<0$ on test data. That does not indicate a bug, just a poor model. This applies whether the out-of-sample data have the same mean as the in-sample data or not. // Just splitting the data randomly would cause the means of the train and test sets to be equal in expected value, yes, but not necessarily equal in any given split. (In fact, it would be quite surprising for the observed train and test means to be exactly equal.) – Dave Jan 11 '23 at 08:01
yes it is a "poor" model, because it cannot even fit the mean of the test data. assuming that the data is randomly split, this should not happen. which is why I suggest a bug. – seanv507 Jan 11 '23 at 08:09
What do you mean by fitting the mean of the test data, and why do you have to be able to do this? – Dave Jan 11 '23 at 08:19

Why is R^2 negative in my multiple linear regression model in python?

Predicting the Test set results

2 Answers2

Linked