16

If we have $k$ categories of a categorical variable, why do we need $k-1$ dummy variables? For example, if there are 8 categories, why don't we code them as

0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1?

Only 3 dummy variables would be needed as opposed to the commonly used $8-1 = 7$ dummy variables.

Sycorax
  • 90,934
Vika
  • 351
  • This won't work with many ML models which consider variables one by one when building the model (for ex. random forests). – user2974951 Oct 25 '22 at 12:16
  • 1
    I suspect my answer to the question Why do we need to dummy code categorical variables? will prove illuminating. Tl;dr: binary coding induces (fictional) numerical relationships between unordered categories which will bias your estimates. – Alexis Oct 25 '22 at 19:01
  • 2
    If you labelled the categories $A,B,C,D,E,F,G,H$ with $A$ corresponding to 0 0 0 and then your codes, you would be suggesting that the difference between $D$ and $A$ was in effect a combination of the difference between $B$ and $A$ and the difference between $C$ and $A$ – Henry Oct 26 '22 at 13:15

4 Answers4

22

For one, the predictor variables would not be orthogonal. Generally, linear regressions work better (i.e., have tighter error estimates) when predictors are close to orthogonal (which can be quantified using the condition number of the Gram matrix $X^tX$). But the bigger issue with this proposed coding scheme is that it introduces similarity structure to the features that is completely arbitrary. For example the first row and the second row are less similar than the first row and the third row (as measured by, e.g., dot product similarity), but in a categorical variable the labeling of the values (or what is the same thing, the ordering of the rows of the encoding matrix) has no inherent meaning. So combining these two considerations, the only way that really makes sense to code a categorical variable with $K$ values is to choose $k$ orthogonal vectors, which requires a space of dimension $>=K$ (or really, $>=K-1$ since you can absorb one of them into the intercept). The typical coding using one-hotes is a particularly simple way to accomplish this, although in principle you could also use any set of $K$ orthogonal vectors.

Simon Segert
  • 1,974
17

You need a dummy variable for each level so that each level can have its own coefficient, independent of the other levels.

Think about how linear models work. The predicted value is $$ \hat{y}_i = \sum_j \beta_j x_{ji} $$ With dummy variables (aka "one-hot encoding"), for each $i$, only one of the $\beta$, the one corresponding to the level of the categorical variable, is added to the sum. Thus, each level gets its own coefficient that defines the outcome's response when the categorical variable takes on that value$^\star$.

Now, consider what happens with your proposed encoding. Levels 1 and 2 each get their own response, but Level 3 is constrained to be the sum of the responses for levels 1 and 2. There is no reason to suppose that this should be the case, so you shouldn't build that constraint into the model. Having a separate dummy variable for each level allows each level to have its own response that does not depend in any way on the other levels.

$^\star$ Well, except for the one special level that gets its response rolled into the intercept term, but you shouldn't do that. If there are $k$ levels, use $k$ dummy variables and no intercept instead of $k-1$ plus an intercept. For example, if you're doing the model in R, use $y\sim x+0$ instead of $y\sim x$.

Nobody
  • 2,025
  • 11
  • 11
  • 1
    Upvoted especially because of the last footnote. I don't understand why the y~x+0 behavior isn't the default in R modeling packages. There is usually no reason why one of the levels should only be defined by an intercept, when all the other levels have an intercept and "slope." That means you think that particular level has less uncertainty in its expected value than all the other ones, which makes no sense in most cases. – qdread Oct 25 '22 at 14:21
  • 1
    @qdread Exactly so. And if that first level happens to be particularly poorly observed (and Murphy's Law being what it is, the worst observed level always seems to wind up in that slot), then its uncertainty will infect the standard errors for the coefficients of the other levels. – Nobody Oct 25 '22 at 14:52
  • 3
    @qdread it's because the y~x+0 gives coefficients that do not provide meaningful hypothesis tests "on their own" (i.e. by default when using common stats packages). You would need to specify contrasts to be able to make meaningful inference, because rejecting the null associated with an coefficient in the y~x+0 case simply means "the sum of the intercept term and this main effect is nonzero", which is not useful. Whereas in the y~x case, default t-tests give p-values associated directly with the main effect of that variable. – John Madden Oct 26 '22 at 17:44
  • 1
    @JohnMadden Good point. Thing is, those hypothesis tests are only worthwhile if there is some special significance to the baseline case. By default, it's just whichever level happened to come first alphabetically, which usually isn't that useful. If there are more than two levels, then most of the time it's better to have confidence intervals on the fixed effect for each level. But, as you say, in that case there is no special significance to whether the coefficient is or is not nonzero (and you also have to remember that it tells you little about the difference between any two coefs.) – Nobody Oct 27 '22 at 14:19
  • @JohnMadden yes that makes sense ... I pretty much always specify contrasts later on and ignore the default t-tests, thus why I'd prefer the x+0 as default. – qdread Oct 27 '22 at 14:29
  • The reason we use $k-1$ dummies vs a baseline, instead of $k$ dummies and no intercept, is that it keeps working when we have several different categorical variables. If variable $A$ has 5 levels, and variable $B$ has 5 levels, and we encode variable $A$ using 5 dummies and no intercept... then we can't really do the same for variable $B$. But if we encode $A$ using 4 dummies for "difference from A's baseline level", and encode $B$ using 4 dummies for "difference from B's baseline level," it works just fine. – civilstat Oct 27 '22 at 19:01
  • @civilstat That's a valid point, but it's easily fixed by adding regularizing priors, though in that case you have to use something better than the built-in glm functions. I often run hierarchical models in Stan with several random effects on categorical variables, and no baseline levels for any of them, and it works just fine. – Nobody Oct 28 '22 at 13:43
  • @Nobody If you only want predictions, not inferences, then I agree you can use regularization to ensure the model can be computed. But what is the interpretation of these regression coefficients? With a single categorical predictor $A$, one-hot encoding using all $k$ dummies, and no intercept, then $\hat\beta_{A=j}$ is the estimated mean response value for a data point which has the $j$th level of predictor variable $A$. But with two categorical predictors $A$ and $B$, and one-hot encoding for both, what is the interpretation of $\hat\beta_{A=j}$? – civilstat Oct 28 '22 at 20:01
  • @civilstat The interpretation with multiple categorical variables is basically the same as it is for just one. If $\beta$ are the coefficients for $A$ and $\gamma$ are the coefficients of $B$, then when $\hat{\beta}_j + \hat{\gamma}_k$ is the mean response when $A = j$ and $B = k$. The regularization is needed to cope with the fact that the $\beta$ and $\gamma$ are not separately identifiable. Also, since you normally solve these models with Bayesian Monte Carlo, your MC samples have enough information to reconstruct the posterior distributions of the contrasts, if you want them too. – Nobody Oct 28 '22 at 20:53
  • @Nobody I agree, but $\hat\beta_j + \hat\gamma_k$ is a prediction. How do you interpret just $\hat\beta_j$ on its own? It doesn't have a clear interpretation to me. Please help me understand the "you shouldn't do that" in your answer, and with @qdread's default of y~x+0. It seems you dislike using $k-1$ dummies b/c you have to interpret coefs as contrasts... but isn't that better than having no clear interpretation to the coefs? – civilstat Oct 30 '22 at 14:44
  • You also seem to dislike $k-1$ dummies b/c you'd have to choose a baseline... but isn't that simpler than choosing priors? For the $k$ dummies approach, 0 seems like the wrong center for a prior (whatever the coefs are, they are not contrasts); and choosing a scale for the prior takes further thought. Meanwhile for $k-1$ dummies, just use majority group as baseline (if none is obvious from subject matter). Again, I'm not opposed to using heavy machinery; I just don't see why you think we should default to heavier machinery that requires more hyperparams and also seems less interpretable. – civilstat Oct 30 '22 at 14:49
10

When you do the standard encoding, you are using $0$ and $1$ as “off” and “on”. When you give a $1$, you turn on the effect of being in the corresponding category over (or under, or maybe neither) the baseline category that is subsumed by the intercept.

With your encoding, that interpretation is lost. For instance, what is the difference between $001$ and $101$ in your coding scheme? Further, what does it mean if $x_1=1?$

Arguably (much) worse, the model performance depends on how you do the coding.

set.seed(2022)
N <- 100
categories <- rep(c("a", "b", "c", "d", "e", "f", "g", "h"), rep(N, 8))
y <- seq(1, length(categories), 1) #rnorm(length(categories))
x1_1 <- x2_1 <- x3_1 <- x1_2 <- x2_2 <- x3_2 <- rep(NA, length(y))
for (i in 1:length(y)){

if (categories[i] == "a"){ x1_1[i] <- 0 x2_1[i] <- 0 x3_1[i] <- 1

x1_2[i] &lt;- 0
x2_2[i] &lt;- 0
x3_2[i] &lt;- 0

}

if (categories[i] == "b"){ x1_1[i] <- 0 x2_1[i] <- 1 x3_1[i] <- 0

x1_2[i] &lt;- 0
x2_2[i] &lt;- 0
x3_2[i] &lt;- 1

}

if (categories[i] == "c"){ x1_1[i] <- 0 x2_1[i] <- 1 x3_1[i] <- 1

x1_2[i] &lt;- 0
x2_2[i] &lt;- 1
x3_2[i] &lt;- 0

}

if (categories[i] == "d"){ x1_1[i] <- 1 x2_1[i] <- 0 x3_1[i] <- 0

x1_2[i] &lt;- 0
x2_2[i] &lt;- 1
x3_2[i] &lt;- 1

}

if (categories[i] == "e"){ x1_1[i] <- 1 x2_1[i] <- 0 x3_1[i] <- 1

x1_2[i] &lt;- 1
x2_2[i] &lt;- 0
x3_2[i] &lt;- 0

}

if (categories[i] == "f"){ x1_1[i] <- 1 x2_1[i] <- 1 x3_1[i] <- 0

x1_2[i] &lt;- 1
x2_2[i] &lt;- 0
x3_2[i] &lt;- 1

}

if (categories[i] == "g"){ x1_1[i] <- 1 x2_1[i] <- 1 x3_1[i] <- 1

x1_2[i] &lt;- 1
x2_2[i] &lt;- 1
x3_2[i] &lt;- 0

}

if (categories[i] == "h"){ x1_1[i] <- 0 x2_1[i] <- 0 x3_1[i] <- 0

x1_2[i] &lt;- 1
x2_2[i] &lt;- 1
x3_2[i] &lt;- 1

} }

L1 <- lm(y ~ x1_1 + x2_1 + x3_1) # R^2 = 0.2344, p < 2.2e-16 L2 <- lm(y ~ x1_2 + x2_2 + x3_2) # R^2 = 0.9844, p < 2.2e-16 L3 <- lm(y ~ categories) # R^2 = 0.9844, p < 2.2e-16

By coincidence, the coding that I did as an alternative to the coding in the OP resulted in the same $R^2$ as the standard way of encoding the categorical variable. However, I had no way of knowing that when I started doing the alternative coding.

It is worth a mention that, when I change the y variable, that coincidence vanishes. Additionally, the p-values for the overall $F$-test differ between all three models. (This was probably true with the first simulation, but the p-values were so small that R cannot distinguish between them.)

set.seed(2022)
N <- 100
categories <- rep(c("a", "b", "c", "d", "e", "f", "g", "h"), rep(N, 8))
y <- rnorm(length(categories))
x1_1 <- x2_1 <- x3_1 <- x1_2 <- x2_2 <- x3_2 <- rep(NA, length(y))
for (i in 1:length(y)){

if (categories[i] == "a"){ x1_1[i] <- 0 x2_1[i] <- 0 x3_1[i] <- 1

x1_2[i] &lt;- 0
x2_2[i] &lt;- 0
x3_2[i] &lt;- 0

}

if (categories[i] == "b"){ x1_1[i] <- 0 x2_1[i] <- 1 x3_1[i] <- 0

x1_2[i] &lt;- 0
x2_2[i] &lt;- 0
x3_2[i] &lt;- 1

}

if (categories[i] == "c"){ x1_1[i] <- 0 x2_1[i] <- 1 x3_1[i] <- 1

x1_2[i] &lt;- 0
x2_2[i] &lt;- 1
x3_2[i] &lt;- 0

}

if (categories[i] == "d"){ x1_1[i] <- 1 x2_1[i] <- 0 x3_1[i] <- 0

x1_2[i] &lt;- 0
x2_2[i] &lt;- 1
x3_2[i] &lt;- 1

}

if (categories[i] == "e"){ x1_1[i] <- 1 x2_1[i] <- 0 x3_1[i] <- 1

x1_2[i] &lt;- 1
x2_2[i] &lt;- 0
x3_2[i] &lt;- 0

}

if (categories[i] == "f"){ x1_1[i] <- 1 x2_1[i] <- 1 x3_1[i] <- 0

x1_2[i] &lt;- 1
x2_2[i] &lt;- 0
x3_2[i] &lt;- 1

}

if (categories[i] == "g"){ x1_1[i] <- 1 x2_1[i] <- 1 x3_1[i] <- 1

x1_2[i] &lt;- 1
x2_2[i] &lt;- 1
x3_2[i] &lt;- 0

}

if (categories[i] == "h"){ x1_1[i] <- 0 x2_1[i] <- 0 x3_1[i] <- 0

x1_2[i] &lt;- 1
x2_2[i] &lt;- 1
x3_2[i] &lt;- 1

} }

L1 <- lm(y ~ x1_1 + x2_1 + x3_1) # R^2 = 0.005842, p = 0.1979 L2 <- lm(y ~ x1_2 + x2_2 + x3_2) # R^2 = 0.001438, p = 0.7659 L3 <- lm(y ~ categories) # R^2 = 0.0123, p = 0.1981

Dave
  • 62,186
8

The problem is not to encode the different id values of the categories.

if there are 8 categories, why don't we code them as

0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1?

Instead, the problem is to encode the different effects of the categories.


Your suggestion is to use only three dummy variables to model the effect as

$$\text{effect of category $=$ $\alpha_1$ (if category $=$ A, C, E, G) + $\alpha_2$ (if category $=$ A, B, E, F) + $\alpha_3$ (if category $=$ A, B, C, D)}$$

In this way you can indeed encode eight category values by a binary 3 bit binary number.

If we choose $\alpha_1 = 1$, $\alpha_2 = 2$ and $\alpha_3 = 4$ then we can get the values of the 8 id values

$$\begin{array}{} A & 1\alpha_1 + 1\alpha_2 + 1\alpha_3 = 7 \\ B & 0\alpha_1 + 1\alpha_2 + 1\alpha_3 = 6 \\ C & 1\alpha_1 + 0\alpha_2 + 1\alpha_3 = 5 \\ D & 0\alpha_1 + 0\alpha_2 + 1\alpha_3 = 4 \\ E & 1\alpha_1 + 1\alpha_2 + 0\alpha_3 = 3 \\ F & 0\alpha_1 + 1\alpha_2 + 0\alpha_3 = 2 \\ G & 1\alpha_1 + 0\alpha_2 + 0\alpha_3 = 1 \\ H & 0\alpha_1 + 0\alpha_2 + 0\alpha_3 = 0 \\ \end{array}$$

But we are not interested in modelling the id-values 7,6,5,4,3,2,1,0. Instead we want to model the entire possible space of effects. That space is 8-dimensional and not 3-dimensional.

  • +1 This is an interesting and useful way to phrase the “on/off” I mentioned in my answer. – Dave Oct 26 '22 at 10:29