1

i want to evaluate the organization based on the number of satisfied customers and the r^2 is negative this is the original data

    SECTOR  price   profit  INSPECTION  licenses    PR  CS(S)   CS(NOT S)   A(s)    A (nonS)
0   A   3809    1643    6834.0  499.0   4053    203.0   45.0     NaN        NAN
1   B   18608   16270   6828.0  2815.0  10923   35.0    5.0      1980.0     200
2   C   3814    1861    2375.0  509.0   2107    99.0    43.0     NaN        NaN
3   A   15869   20293   2595.0  2206.0  5285    30.0    5.0      1150.0     NaN
4   B   5663    1881    3629.0  734.0   5667    220.0   55.0     NaN        565.0

and A(s) stands for the number of satisfied customers for the whole sectors meaning 200 include the services provided by sector A B and C

i foucused on sector B and does it affect A(s) or not

converted Sector so dummy and then deleted A and C sectors this is what i have now

df1.corr()

price profit INSPECTION licenses PR A(s) CLEARANCE 1.000000 0.376304 0.211653 -0.044924 0.397780 0.389236 PERMITS 0.376304 1.000000 -0.021812 -0.158237 0.089504 0.373245 INSPECTION 0.211653 -0.021812 1.000000 0.573478 0.438797 0.245204 Facilities licenses -0.044924 -0.158237 0.573478 1.000000 0.050931 -0.164353 PR 0.397780 0.089504 0.438797 0.050931 1.000000 0.497360

x = np.array(df1.drop(['A(s)'], axis=1)) y = df1['A(s)'].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)

from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)

Predicting the Test set results

y_pred = regressor.predict(X_test)

from sklearn.metrics import r2_score

r2_score(regressor.predict(X_test), y_test)

the result of r^2 -0.6052320362843366 i do not know why the sign is negative please help and thank you

User1865345
  • 8,202
kjnk
  • 11

2 Answers2

2

It means that your sum of squared residuals is greater than the sum of squared residuals of a model that always predicts the out-of-sample mean. This can be regarded as a baseline, “must beat” model. That you cannot achieve stronger performance than this baseline model means that your model is not doing a good job of predicting. While this might seem disappointing, you do out-of-sample testing to catch when you have such a model, so not all is lost.

I dislike the sklearn implementation of out-of-sample $R^2$ and find it to lack motivation. However, I would expect your training and testing means to be similar, so the $R^2$ you’ve reported is unlikely to differ much from what I would get from my preferred calculation.

Note that a linear regression, fitted using ordinary least squares, is guaranteed to have a non-negative in-sample $R^2$ as long as the model contains an intercept. Deviating from that situation, however, either because you use a nonlinear model, a linear model estimated by a method other than minimizing the sum of squared residuals, excluding an intercept, or measuring out-of-sample $R^2$, removes such a guarantee, as you have seen from your result.

Dave
  • 62,186
  • can you please tell me how to fix this problem and should i use another model? – kjnk Jan 11 '23 at 06:18
  • It’s a matter of the usual strategies for predictive modeling. Beyond that, your question is essentially asking how to do machine learning. Do you have a more specific question? If so, please consider posting a new question to ask. – Dave Jan 11 '23 at 06:31
-1

This looks like a bug.

If you get negative r^2 on your test data (IIRC) with linear regression, it means your test data has a diff mean from your training data (or you are lacking an intercept).

sklearn defaults to using intercept.

Since you are doing a train test split and nothing else, you would not expect the data sets to differ.

I suspect that changing the random state to a non zero number might help.maybe 0 means that it doesn't shuffle the data and your data is ordered somehow?

seanv507
  • 6,743
  • 1
    Sorry, this is not correct. It is fine to get $R^2<0$ on test data. That does not indicate a bug, just a poor model. This applies whether the out-of-sample data have the same mean as the in-sample data or not. // Just splitting the data randomly would cause the means of the train and test sets to be equal in expected value, yes, but not necessarily equal in any given split. (In fact, it would be quite surprising for the observed train and test means to be exactly equal.) – Dave Jan 11 '23 at 08:01
  • yes it is a "poor" model, because it cannot even fit the mean of the test data. assuming that the data is randomly split, this should not happen. which is why I suggest a bug. – seanv507 Jan 11 '23 at 08:09
  • What do you mean by fitting the mean of the test data, and why do you have to be able to do this? – Dave Jan 11 '23 at 08:19