Variance explained - equivalent statistics for categorical data?

Question

I have a multinomial response variable and a multinomial "independent" variable. Is there an equivalent statistics or method for calculating the variance explained by the independent variable?

I'm using survey-weighted logistic regression model (if you use R, the function is svyglm with quasibinomial family) -- I recoded my response variable as binary. But I also have this challenge with a "regular" model-based logistic regression problem — godspeed, Jun 01 '16 at 02:47
I'm looking for an answer for this too! I have a set of models using both bath categorical and numeric inputs. I can report an R2 as a crude indicator of the fraction of variation "explained" by my dependent variable for the numeric predictors, but how can I report a similar metric for my categorical predictors? — ErichBSchulz, Apr 16 '22 at 04:10

Ben · Answer 1 · 2022-04-18T02:03:10.107

5

For generalised linear models (GLMs) we use deviance values as a generalisation of the scaled sums-of-squares used in regression (see related answer here). Instead of the coefficient of determination used in linear regression, we would use McFadden's $R^2$ value, which is given by:

$$R^2_\text{GLM} = \frac{\hat{\ell}_p - \hat{\ell}_0}{\hat{\ell}_S - \hat{\ell}_0},$$

where $\hat{\ell}_S$ is the maximised log-likelihood under the saturated model (one coefficient per data point), $\hat{\ell}_p$ is the maximised log-likelihood under the actual model, and $\hat{\ell}_0$ is the maximised log-likelihood under the null model (intercept term only).

This goodness-of-fit quantity measures the proportion of the deviance beyond the null model that is explained by the explanatory variables in the actual model. It is a generalisation of the coefficient of determination in the Gaussian linear regression model (i.e., it reduces down to that statistic in that model).

edited Apr 18 '22 at 02:03

answered Apr 16 '22 at 09:29

Ben

124,856

2

At least for the bounty, I think that it would be important to mention that using deviance this way (or doing something with Brier score to be more aligned with square loss in OLS $R^2$), does not depend on the features being of any particular type. – Dave Apr 16 '22 at 12:05
1

Subscript edit: $\hat{\ell}_0$ is the maximised log-likelihood under the null model (intercept term only). CV considers edits under 6 characters invalid, so I'm posting here. – krkeane Apr 17 '22 at 12:47
@krkeane: Thanks --- edited. – Ben Apr 18 '22 at 02:03
I like log-likelihood based measures. There are pseudo $R^2$ measures I like more than McFadden's: https://hbiostat.org/bib/r2.html – Frank Harrell Apr 18 '22 at 11:19

score 1 · Answer 2 · answered Apr 16 '22 at 04:49

If you’re comfortable using $R^2$ as a crude approximation for proportion of variance explained, knowing that it is not true in the nonlinear case like in a logistic regression (but you decide how good it is for your work as an easy-to-compute estimate that your audience thinks it understands), then it doesn’t matter if you’re using categorical or continuous features. The decomposition of the total sum of squares that leads to $R^2$, which is given in the link, never explicitly mentions the features (yes, they’re in there implicitly through the predictions, $\hat y_i$), only the observations and predicted values.

I actually think the bounty comments ask a different question than the original post. Since the bounty message will go away in a few days, I’ll include the text below, as I believe I am addressing the bounty question more than the original question. Additionally, the bounty asks for a reputable source. I’d expect most regression textbooks to give the decomposition of the total sum of squares to explain how $R^2$ winds up giving the proportion of variance explained (at least in OLS linear with an intercept). My professor used Agresti’s book, so that’s what I know. What you’re looking for is in chapter 2, pages 47-54.

Agresti, Alan. Foundations of linear and generalized linear models. John Wiley & Sons, 2015.

BOUNTY

I can report an R2 as a crude indicator of the fraction of variation "explained" by numeric predictors, but how can I report a similar metric for categorical predictors? It feels like there ought to be a simple expression for this but I'm struggling to find a reference that expresses it!

LOL - as I spent my rep on the bounty for this question I am now unable to comment on the answers (because reputaion < 50!), so apologies for replying via an "answer". Dave thank you for your answer and the link to your related answer which was very helpful. - I wish I could have split the bounty. Ultimately I had to pick one and Vasilis's link to the a detailed review was great. — ErichBSchulz, Apr 18 '22 at 01:35
and thanks to the admin that gave me a few points and moved my comment to the correct field :-) — ErichBSchulz, Apr 18 '22 at 07:19

score 1 · Answer 3 · answered Apr 16 '22 at 11:53

Given that you only have one dependent and one independent variable, you can perform correlation tests tailored for categorical variables and report the correlation statistics (and their p-values). This article is a good overview of your options.

That said, alternatively you can simply perform cross validation and report the accuracy metrics from there. However, although this is a straightforward and a more "universal" solution let's say, be aware that this is a biased estimator of the variance.

score 1 · Answer 4 · answered Apr 18 '22 at 04:59

Based on Dave's and Vasilis' response (I'm sorry Ben my non-statistical brain could not quite absorb your answer - which is on me) I wrote and roughly validated a python function to produce a grid of "R2 equivalents".

It worries me a bit that I'm needing to write code to do what I would have thought was a common task. Thus I'm posting my code (a) so others with a similar need may use it and (b) for review by anybody with both experience and time to provide feedback!

def explain_variation_by_category_grid(data_in, cat_cols, cont_cols, dropna = True):
    '''
    Generate a grid of quasi r2 values of for the approximate fraction of squared
    difference in the continuous variable associated with each category.
Influenced by discussion here:
    https://stats.stackexchange.com/questions/215606/variance-explained-equivalent-statistics-for-categorical-data

    Parameters:
        data_in (pandas dataframe):
        cat_cols (list): category columns
        cont_cols (list): continuous columns
        dropna (Bool): exlude NA values

    Returns:
        quasi R2 (dataframe): indexed on the continuous variables and one column for each category 
'''

def sum_of_var_sqd(a):
    ''' calculate sum of sqared difference from group mean '''
    mean = a.mean()
    return ((a - mean) ** 2).sum()

res_dict = {}
for split in cat_cols:
    res_dict[split] = {}
    for var in cont_cols:
        f = [split, var]
        data = data_in[f].dropna() if dropna else data_in[f]
        all_sum_of_var_sqd = sum_of_var_sqd(data[data.columns[1]])
        agg = pd.pivot_table(data, index=split, aggfunc=[sum_of_var_sqd])
        cat_sum_of_var_sqd = agg[agg.columns[0]].sum()
        r2 = 1 - (cat_sum_of_var_sqd /all_sum_of_var_sqd )
        res_dict[split][var] = r2
return pd.DataFrame.from_dict(res_dict)

My test data consists of about 150k records, which I've used to train NN and boosted forest (BF) models, and generated out of sample predictions (_pred) and calculated residuals (*_act - *_pre = *_d) for two outcomes of interest (Recovery time = ~ minutes to wake up after surgery; DAOH30 = "Days alive and out of hospital").

As you can see the categorical predictors are all weakly associated with the actual values but the models (using the data they had) all made predictions more strongly associated with predictors. The residuals had values close to zero which is what I would hope for given the expectation is the that the model would be cancelling out the impact of the variables.

Variance explained - equivalent statistics for categorical data?

4 Answers4