Significance of categorical predictor in logistic regression

Question

I am having trouble interpreting the z values for categorical variables in logistic regression. In the example below I have a categorical variable with 3 classes and according to the z value, CLASS2 might be relevant while the others are not.

But now what does this mean?

That I could merge the other classes to one?
That the whole variable might not be a good predictor?

This is just an example and the actual z values here are not from a real problem, I just have difficulties about their interpretation.

           Estimate    Std. Error  z value Pr(>|z|)    
CLASS0     6.069e-02  1.564e-01   0.388   0.6979    
CLASS1     1.734e-01  2.630e-01   0.659   0.5098    
CLASS2     1.597e+00  6.354e-01   2.514   0.0119 *

COOLSerdash · Accepted Answer · 2019-08-26T15:15:50.463

The following explanation is not limited to logistic regression but applies equally in normal linear regression and other GLMs. Usually, R excludes one level of the categorical and the coefficients denote the difference of each class to this reference class (or sometimes called baseline class) (this is called dummy coding or treatment contrasts in R, see here for an excellent overview of the different contrast options). To see the current contrasts in R, type options("contrasts"). Normally, R orders the levels of the categorical variable alphabetically and takes the first as reference class. This is not always optimal and can be changed by typing (here, we would set the reference class to "c" in the new variable) new.variable <- relevel(old.variable, ref="c"). For each coefficient of every level of the categorical variable, a Wald test is performed to test whether the pairwise difference between the coefficient of the reference class and the other class is different from zero or not. This is what the $z$ and $p$-values in the regression table are. If only one categorical class is significant, this does not imply that the whole variable is meaningless and should be removed from the model. You can check the overall effect of the variable by performing a likelihood ratio test: fit two models, one with and one without the variable and type anova(model1, model2, test="LRT") in R (see example below). Here is an example:

mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

mydata$rank <- factor(mydata$rank)

my.mod <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")

summary(my.mod)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The level rank1 has been omitted and each coefficient of rank denotes the difference between the coefficient of rank1 and the corresponding rank level. So the difference between the coefficient of rank1 and rank2 would be $-0.675$. The coefficient of rank1 is simply the intercept. So the true coefficient of rank2 would be $-3.99 - 0.675 = -4.67$. The Wald tests here test whether the difference between the coefficient of the reference class (here rank1) and the corresponding levels differ from zero. In this case, we have evidence that the coefficients of all classes differ from the coefficient of rank1. You could also fit the model without an intercept by adding - 1 to the model formula to see all coefficients directly:

my.mod2 <- glm(admit ~ gre + gpa + rank - 1, data = mydata, family = "binomial")

summary(my.mod2) # no intercept model

Coefficients:
       Estimate Std. Error z value Pr(>|z|)    
gre    0.002264   0.001094   2.070 0.038465 *  
gpa    0.804038   0.331819   2.423 0.015388 *  
rank1 -3.989979   1.139951  -3.500 0.000465 ***
rank2 -4.665422   1.109370  -4.205 2.61e-05 ***
rank3 -5.330183   1.149538  -4.637 3.54e-06 ***
rank4 -5.541443   1.138072  -4.869 1.12e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that the intercept is gone now and that the coefficient of rank1 is exactly the intercept of the first model. Here, the Wald test checks not the pairwise difference between coefficients but the hypothesis that each individual coefficient is zero. Again, we have evidence that every coefficient of rank differs from zero. Finally, to check whether the whole variable rank improves the model fit, we fit one model with (my.mod1) and one without the variable rank (my.mod2) and conduct a likelihood ratio test. This tests the hypothesis that all coefficients of rank are zero:

my.mod1 <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial") # with rank
my.mod2 <- glm(admit ~ gre + gpa, data = mydata, family = "binomial") # without rank

anova(my.mod1, my.mod2, test="LRT")

Analysis of Deviance Table

Model 1: admit ~ gre + gpa + rank
Model 2: admit ~ gre + gpa
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       394     458.52                          
2       397     480.34 -3  -21.826 7.088e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The likelihood ratio test is highly significant and we would conclude that the variable rank should remain in the model.

This post is also very interesting.

Very informative answer! One quick question: what if we only have rank as the predictor? For performing LRT test, the null would be admit ~ 1 vs admit ~ rank - 1 ? — NULL, Feb 28 '18 at 18:35
@NULL Yes, either the null vs. admit~rank or admit~rank - 1, they are equivalent regarding the fit. — COOLSerdash, Mar 01 '18 at 07:25
Great – thanks! One other quick question: when I'm interested in performing one-tailed Wald test of coefficients of the categorical variable (without comparing to baseline - I mean no intercept), would the effect of other predictors such as gre and gpa be affected by this lack of inclusion of intercept? — NULL, Mar 03 '18 at 04:53
In other words, if I want to perform one-tailed test on coefficients of the categorical predictor while controlling for other covariates, is this the way to go? and would this also be applicable to linear regression? — NULL, Mar 03 '18 at 04:55
@NULL The likelihood-ratio test tests the joint hypothesis, that all coefficients of rank are zero. If you would like to perform individual Wald-tests on each of the coefficients of rank, you can fit the model without intercept and look at the $p$-values. For one-sided tests, see this post here. And yes, all of the above applies to linear regression as well, as stated at the beginning of the answer. — COOLSerdash, Mar 03 '18 at 08:19
@COOLSerdash This answer was super helpful and helped me a lot with a problem I am currently having! However, I was wondering what it means when the coefficients are significant, but the likelihood-ratio test is not. In my case, I have significant coefficients for one condition in both mymod.1 (comparing a control with other conditions, where one condition versus control is significant) and mymod.2 (no intercept, again seeing significance for one condition), but comparing the null with either of these models is non-significant (p = .08). Should I still report the significant coefficients? — becbot, Jul 30 '21 at 09:51
@becbot Please ask a new question and include all relevant information there (model outputs for example). — COOLSerdash, Jul 30 '21 at 10:03
'my.mod <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")' how in this case the rank becomes the intercept what i read so far is what if there is no reference level defined then it will choose whatever comes alphabetically first ..so wouldn't the gre be going to the intercept first? — kcm, Mar 30 '22 at 08:53
@kcm gre is a continuous variable and not a factor and hence, there is no reference level for gre. R will use the first level of each factor as reference value. Here, only rank is categorical. — COOLSerdash, Mar 30 '22 at 08:58
thank you for clarifying this very basic concept although i might have ran these sorts of model in deseq2 many times ..hard for biology people like me to first run the library and then see how to interpret the result .. — kcm, Mar 30 '22 at 09:02
@COOLSerdash is it possible to define reference level in LRT test ? for example i have three factor C1,C2,C3 normally R will consider C1 as the baseline but if i want to set C2 as my baseline, then is it possible to run the LRT test ? — kcm, Mar 30 '22 at 09:57
my.mod2 <- glm(admit ~ gre + gpa + rank - 1, data = mydata, family = "binomial") in this case the intercept is gone so what is the baseline here? — kcm, Mar 30 '22 at 10:07
@kcm I'm not sure what exactly you're trying to do. You can change the reference level by using the relevel function, e.g. relevel(rank, ref = 2). The LRT for the complete model does not change if you change the reference level. — COOLSerdash, Mar 30 '22 at 13:08
that i was not clear thanks for clarifying I thought i can only change reference in case of pairwise test which is wald test but not sure if it is done for LRT it would be correct or not .. — kcm, Mar 30 '22 at 14:29

score 10 · Answer 2 · edited Jun 04 '13 at 07:58

The $z$-value is just the test-statistic for a statistical test, so if you have trouble interpreting it your first step is to find out what the null hypothesis is. The null hypothesis for the test for CLASS0 is that its coefficient is 0. The coefficient for CLASS0 is the difference in log(odds) between CLASS0 and the reference class (CLASS3?) is zero, or equivalently, that the ratio of the odds for CLASS0 and the reference class is 1. In other words that there is no difference in the odds of success between CLASS0 and the reference class.

So does a non-significant coefficient mean you can merge categories? No. First, non-significant means that we cannot reject the hypothesis that there is no difference, but that does not mean that no such differences exist. An absence of evidence is not the same thing as evidence of absence. Second, merging categories, especially the reference category, changes the interpretation of all other coefficients. Whether or not that makes sense depends on what those different classes stand for.

Does that mean that the entire categorical variable is a "bad" (non-significant) predictor? No, for that you would need to perform a simultaneous test for all CLASS terms.

Significance of categorical predictor in logistic regression

2 Answers2

Linked

Related