How can logistic regression have a factorial predictor and no intercept?

Question

I tried a regression in the form ${\rm logit}(Y) = {\rm coefficient}\times X + 0 + e$, where $Y$ is a binomial variable and $X$ is a factor variable with $n$ levels. I noticed that removing the intercept yields higher $p$ values. I'm wondering how to interpret it though.

Since removing the intercept makes it equal to $0$, I believe that the coefficients returned are relative to a $0$ probability of the event $Y$ and that all $X$ factors are in the $0$ state. But I think this is impossible isn't it?

$X$ are mutually exclusive factors, therefore it's impossible to have a case where no factor is $1$, at least in the presented observations. And it cannot be interpreted like the coefficient is relative to hypothetical cases in which really no one of the factors is present, because we have no data like that.

Regarding $Y$ having a $0$ intercept, wouldn't it mean forcing the probability of the event to $0$ when none of the factors is present? Again, this is an impossible case.

Nonetheless this kind of regression would allow me to retrieve pure probability range of the event Y given a factor by transforming the coefficients in the confidence intervals given as $\exp({\rm coefficient})/(1 + \exp({\rm coefficient}))$, and the $p$ values would test whether this probability is not $50\%$. This could also be a valuable result, since it would give independent probabilities for each factor.

Am I wrong?

unclear: $Y$ is a random variable not an event. If it takes only integer values, the regression model $Y=\beta X+\epsilon$ is inappropriate. — Xi'an, Dec 29 '14 at 19:57
Y is a event that happens or not for an observation, say disease or not disease. It's a binomial variable that can be modeled as 0 or 1, or yes or no, or whatever. — Bakaburg, Dec 30 '14 at 00:46
Although I gather English is not the OP's first language, & the OP is clearly confused about the nature of logistic regression & categorical data, this question seems perfectly clear & answerable to me. — gung - Reinstate Monica, Dec 30 '14 at 16:39
@gung I will vote to reopen with some reluctance: there is so much confusion and misinformation expressed within the body of this question that whoever answers will have to take great care not to be misunderstood! — whuber, Dec 30 '14 at 22:04
@whuber, thanks. I have tried to address the misunderstandings in the question. Let me know if you think I missed any, or if my answer may lead to other (different) misunderstandings. — gung - Reinstate Monica, Dec 31 '14 at 02:41

score 7 · Accepted Answer · edited Apr 13 '17 at 12:44

The problem is that you have several misunderstandings about logistic regression and the representation of categorical data in regression models. First, we don't generally think of there being an error term in logistic regression¹, so the "$e$" in your description of your model should not be there. (Also, since the predictor is a multi-level categorical variable, $\bf X$ will have to be a row vector and the $\rm coefficient$ will have to be a column vector by which it is post multiplied.) Thus, your model should be written: ${\rm logit}(Y=1|\bf X) = \bf X\boldsymbol\beta$.

Next, we need to be clear on what the logit is. Specifically, it is a transformation:
$$ {\rm logit}(y_i = 1) = \ln\bigg(\frac{p_i}{1-p_i}\bigg) $$ Here $p_i$ is the probability that $y_i = 1$, the fraction within the parentheses is the odds² that $y_i = 1$, and $\ln$ is the natural logarithm. Thus, ${\rm logit}(y_i = 1)$ is the log odds that $y_i = 1$. When the right hand side in your model is $0$, the log odds is $0$, so the probability is $.5$, not $0$.

Lastly, when you have a categorical variable with multiple (i.e., $>2$) levels, care must be taken to represent the variable appropriately. There are a number of ways such variables can be represented; the most common is called "reference level coding" (RLC; this is the default in R). Another way is called "level means coding"³ (LMC). In R, if you suppress the intercept when you have a multi-level factor (y ~ x + 0), you will get level means coding by default. Let's consider a simple case where there is a factor with three levels, with 10 observations within each level, and where the observed probabilities of success are .2, .5, .8 (meaning that the logits will be: -1.386, 0, 1.386).

With RLC⁴, your design matrix will be your data matrix with a column of $1$s attached on the left side to represent the intercept, which is always present, but the column for x1 will be missing:
\begin{array}{rrrr} {\rm obs.} &{\rm (Intercept)} &x2 &x3 \\ 1 &1 &0 &0 \\ \vdots &\vdots &\vdots &\vdots \\ 10 &1 &0 &0 \\ 11 &1 &1 &0 \\ \vdots &\vdots &\vdots &\vdots \\ 20 &1 &1 &0 \\ 21 &1 &0 &1 \\ \vdots &\vdots &\vdots &\vdots \\ 30 &1 &0 &1 \end{array} The result of this strange-seeming arrangement is that the (Intercept) will represent x1 (this is the "reference level"), while x2 and x3 will represent the differences between the estimates for these levels and the reference level in your output.

If we use LMC, we won't have an intercept column, with nothing but $1$s, but you will see your missing $x1$ column, which will indicate the observations that were in the first level of the factor:
\begin{array}{rrrr} {\rm obs.} &x1 &x2 &x3 \\ 1 &1 &0 &0 \\ \vdots &\vdots &\vdots &\vdots \\ 10 &1 &0 &0 \\ 11 &0 &1 &0 \\ \vdots &\vdots &\vdots &\vdots \\ 20 &0 &1 &0 \\ 21 &0 &0 &1 \\ \vdots &\vdots &\vdots &\vdots \\ 30 &0 &0 &1 \end{array} In your output, the intercept will be missing but you will see each level represented. In this case, the estimates are the log odds of the outcome in each level directly.

Another important change to be aware of is that with RLC, the test (e.g., $p$-value) associated with the intercept is a test of whether that parameter (i.e., the log odds of 'success' in the reference level, x1) is equal to $0$ (i.e., if the probability is $.5$), and the tests associated with the other levels (x2 and x3 here) are tests of whether they each are equal to whatever x1 is. On the other hand, with LMC all tests are whether the parameter in question is equal to $0$ (again, if the probability equals $.5$).

Here is a quick R demo:

x   = as.factor(rep(c(1,2,3), each=10))
y   = c(rep(0,8), 1, 1, 
        rep(c(0,1), each=5), 
        0,0, rep(1,8)       )
RLC = glm(y~x, family=binomial)
summary(RLC)
# Call:
# glm(formula = y ~ x, family = binomial)
# 
# Deviance Residuals: 
#     Min       1Q   Median       3Q      Max  
# -1.7941  -0.6681   0.0000   0.6681   1.7941  
# 
# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)  
# (Intercept)  -1.3863     0.7906  -1.754   0.0795 .
# x2            1.3863     1.0124   1.369   0.1709  
# x3            2.7726     1.1180   2.480   0.0131 *
# ... 
#     Null deviance: 41.589  on 29  degrees of freedom
# Residual deviance: 33.879  on 27  degrees of freedom
# AIC: 39.879
LMC = glm(y~x+0, family=binomial)
summary(LMC)
# Call:
# glm(formula = y ~ x + 0, family = binomial)
# 
# Deviance Residuals: 
#     Min       1Q   Median       3Q      Max  
# -1.7941  -0.6681   0.0000   0.6681   1.7941  
# 
# Coefficients:
#      Estimate Std. Error z value Pr(>|z|)  
# x1 -1.386e+00  7.906e-01  -1.754   0.0795 .
# x2 -3.511e-17  6.325e-01   0.000   1.0000  
# x3  1.386e+00  7.906e-01   1.754   0.0795 .
# ...
#     Null deviance: 41.589  on 30  degrees of freedom
# Residual deviance: 33.879  on 27  degrees of freedom
# AIC: 39.879

Notice that the residuals, the deviances, the degrees of freedom, and the AIC are identical between the two models. Importantly, notice that the Estimate and Pr(>|z|) for (Intercept) and x1 are identical in RLC and LMC. Further, notice that the Estimates for x2 and x3 in RLC are equal to the differences in the estimates in LMC: viz., 0--1.386 = 1.386, and 1.386--1.386 = 2.7726. Lastly, notice that the Pr(>|z|)s differ for x2 and x3 in RLC and LMC, as they are testing different null hypotheses.

_{1. See, e.g., here and here.

2. For more on probabilities and odds, see my answer here.

3. There are many more, see here for some.

4. I cover some material related to this here.}

Many thanks gung! very clear! two more questions: using a factor bX with many levels, isn't the same of having a multiple regression with b1x1 + b2x2 + b3x3 but in this case with every x mutually exclusive? so the coefficient estimated is relative to each single level being zero, independently from the others. But if you have another, unrelated covariate, say SEX (the usual), the coefficient for this covariate is relative to what? — Bakaburg, Jan 02 '15 at 11:57
Second: if LMC allows me to compare to a condition in which P(Y) is 50%, that is can or cannot happen equally, I see that this could be very useful in clinical studies. But then why everyone use RML? — Bakaburg, Jan 02 '15 at 12:02
@Bakaburg, I'm not sure I follow all your questions. You can't have all levels of a categorical variable represented w/ dummies & have an intercept without having perfect multicollinearity (see my answer here). If you include a continuous covariate in a model, it is tested against 0, just as normal. Re: why RLC is the default, I don't know, you can use any number of coding schemes that are all equally valid, it may be part tradition, it may be part that there is often a natural reference level (ie, the control) against which you want to compare. — gung - Reinstate Monica, Jan 02 '15 at 15:03
You don't really want to interpret the individual p-values from either coding scheme anyway. You test a categorical covariate by dropping the entire variable (all levels / dummies) and fitting a nested model. — gung - Reinstate Monica, Jan 02 '15 at 15:05
Understood the part about colinearity. Estimates p values don't allow me to judge the independent effect of each covariate? I don't get the part of nested models, could you point me to some material about the topic? Thanks! — Bakaburg, Jan 02 '15 at 15:22
OT: if you got time could you give me your opinion about this question regarding firth penalized regression and profile CIs? http://stats.stackexchange.com/questions/130558/understanding-confidence-intervals-in-firth-penalized-logistic-regression/130564?noredirect=1#comment248817_130564 Thanks again! — Bakaburg, Jan 02 '15 at 15:23
If you have a single categorical variable with >2 levels, there is no such thing as the independent effect of each level. To get the gist of testing a nested model to test a categorical variable, you can try this answer, although it is written in a different context. — gung - Reinstate Monica, Jan 02 '15 at 16:59
Re: the comment, I think you may be biting off more than you can chew. My impression is that you don't have a very solid understanding of regression methods & the nature of logistic regression. The comment can be explained, but it would take a lot, & it would require the reader to already be very familiar w/ a lot of other concepts that I suspect I can't take for granted. You may need to work w/ a statistical consultant. In the interim, you might try reading through many of the threads tagged w/ [tag:logistic] here. Most of the relevant info exists already in bits & pieces in various places. — gung - Reinstate Monica, Jan 02 '15 at 17:01
Unfortunately I don't know many very good statisticians! this topics are already very advanced for the ones I know. By the way by using this site I already learned a huge amount of informations! — Bakaburg, Jan 06 '15 at 18:35
I very sympathetic, @Bakaburg. & you seem to be doing well. It's just that some topics are harder to explain. — gung - Reinstate Monica, Jan 06 '15 at 18:47
Hard to extend LMC to the case of more than one predictor acting additively - probably that's why RLC is more popular in the one predictor case. — Scortchi - Reinstate Monica, Jan 24 '15 at 20:41
@Scortchi-ReinstateMonica So it is not possible to code a factor predictor - which is of interest - by LMC, and then add continuous covariates and factor covariates by RLC? How do you think about the case in this: https://stackoverflow.com/q/40889806? — HYL, Nov 06 '23 at 21:05
@HYL: You can do that, but of course the level means are now estimates of the response only when the continuous covariates are zero & the other factor covariates are at their reference levels - & so not always of particular interest. The S.O. q. is a case in point: why show the estimated mean exports by year for home & partner G.D.P.s of £1 (or $1 or whatever)? (My advice would be to code things however you like, but then produce tables (or plots) showing point & interval estimates of the mean & predicted response for predictor combinations of real interest.) — Scortchi - Reinstate Monica, Nov 07 '23 at 00:47

How can logistic regression have a factorial predictor and no intercept?

1 Answers1

Linked

Related