0

How do we assess degree of fitness in a Generalized Linear Model (GLM) since R-squared is not given. For example, following are results of regression in iris dataset with code: smf.glm('SL~Species+SW+PL', data=irisdf, family=sm.families.Gaussian(sm.families.links.log)).fit() using statsmodels.

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                     SL   No. Observations:                  150
Model:                            GLM   Df Residuals:                      145
Model Family:                Gaussian   Df Model:                            4
Link Function:                    log   Scale:                        0.096285
Method:                          IRLS   Log-Likelihood:                -34.807
Date:                Fri, 17 Jul 2020   Deviance:                       13.961
Time:                        12:44:23   Pearson chi2:                     14.0
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 1.1842      0.046     26.024      0.000       1.095       1.273
Species[T.versicolor]    -0.1335      0.035     -3.772      0.000      -0.203      -0.064
Species[T.virginica]     -0.2046      0.046     -4.405      0.000      -0.296      -0.114
SW                        0.0713      0.014      5.118      0.000       0.044       0.099
PL                        0.1244      0.010     12.214      0.000       0.104       0.144
=========================================================================================

==================== Summary2() ==================== Results: Generalized linear model ===================================================================== Model: GLM AIC: 79.6145
Link Function: log BIC: -712.5809 Dependent Variable: SL Log-Likelihood: -34.807
Date: 2020-07-17 12:44 LL-Null: -492.86
No. Observations: 150 Deviance: 13.961
Df Model: 4 Pearson chi2: 14.0
Df Residuals: 145 Scale: 0.096285 Method: IRLS


                   Coef.  Std.Err.    z    P>|z|   [0.025  0.975]

Intercept 1.1842 0.0455 26.0241 0.0000 1.0950 1.2734 Species[T.versicolor] -0.1335 0.0354 -3.7717 0.0002 -0.2028 -0.0641 Species[T.virginica] -0.2046 0.0464 -4.4051 0.0000 -0.2956 -0.1136 SW 0.0713 0.0139 5.1183 0.0000 0.0440 0.0986 PL 0.1244 0.0102 12.2141 0.0000 0.1044 0.1443 =====================================================================

What is the equivalent of R-squared in above analysis?

rnso
  • 10,009

1 Answers1

1

First, the answer. You should be able to calculate the R2 for your model by hand then, sometimes statsmodel provides a pseudo R2 as well:

sst = sum(map(lambda x: np.power(x,2),y-np.mean(y))) 
sse = sum(map(lambda x: np.power(x,2),your_model.resid_response)) 
r2 = 1.0 - sse/sst

But, that being said I do not think assessing your regression model with R2 is the best solution in your case. Why do you not use AIC? There are many discussion whether or not R2 is really the 'golden standard' of assessing your regression. One quote from an interesting post about misunderstandings in statistics says about R2: "Equating a high R2 with a "good model" (or equivalently, lamenting - or, in the case of referees of papers, criticizing - that R2 is "too" low)."

Maybe look into this discussion here and here and re-consider how to report your model performance.

Thomas
  • 528
  • Specifically, what would you say is the performance of example model that I have given in my question above? I would like to see the readout there and determine rather than doing any more calculations. – rnso Jul 17 '20 at 09:52
  • Also just to be clear, it is 1.0 - (sse/sst) and not (1.0 - sse)/sst ? – rnso Jul 17 '20 at 09:54
  • If I had to asses the performance of your model, I would definitely calculate the RMSE as it tells you more about the average error performed by the model in predicting the outcome for an observation. Just look into how the residuals are between observation and prediction. And, it is SSE / SST and then you subtract that from 1, to my knowledge. – Thomas Jul 17 '20 at 11:21