When you do the standard encoding, you are using $0$ and $1$ as “off” and “on”. When you give a $1$, you turn on the effect of being in the corresponding category over (or under, or maybe neither) the baseline category that is subsumed by the intercept.
With your encoding, that interpretation is lost. For instance, what is the difference between $001$ and $101$ in your coding scheme? Further, what does it mean if $x_1=1?$
Arguably (much) worse, the model performance depends on how you do the coding.
set.seed(2022)
N <- 100
categories <- rep(c("a", "b", "c", "d", "e", "f", "g", "h"), rep(N, 8))
y <- seq(1, length(categories), 1) #rnorm(length(categories))
x1_1 <- x2_1 <- x3_1 <- x1_2 <- x2_2 <- x3_2 <- rep(NA, length(y))
for (i in 1:length(y)){
if (categories[i] == "a"){
x1_1[i] <- 0
x2_1[i] <- 0
x3_1[i] <- 1
x1_2[i] <- 0
x2_2[i] <- 0
x3_2[i] <- 0
}
if (categories[i] == "b"){
x1_1[i] <- 0
x2_1[i] <- 1
x3_1[i] <- 0
x1_2[i] <- 0
x2_2[i] <- 0
x3_2[i] <- 1
}
if (categories[i] == "c"){
x1_1[i] <- 0
x2_1[i] <- 1
x3_1[i] <- 1
x1_2[i] <- 0
x2_2[i] <- 1
x3_2[i] <- 0
}
if (categories[i] == "d"){
x1_1[i] <- 1
x2_1[i] <- 0
x3_1[i] <- 0
x1_2[i] <- 0
x2_2[i] <- 1
x3_2[i] <- 1
}
if (categories[i] == "e"){
x1_1[i] <- 1
x2_1[i] <- 0
x3_1[i] <- 1
x1_2[i] <- 1
x2_2[i] <- 0
x3_2[i] <- 0
}
if (categories[i] == "f"){
x1_1[i] <- 1
x2_1[i] <- 1
x3_1[i] <- 0
x1_2[i] <- 1
x2_2[i] <- 0
x3_2[i] <- 1
}
if (categories[i] == "g"){
x1_1[i] <- 1
x2_1[i] <- 1
x3_1[i] <- 1
x1_2[i] <- 1
x2_2[i] <- 1
x3_2[i] <- 0
}
if (categories[i] == "h"){
x1_1[i] <- 0
x2_1[i] <- 0
x3_1[i] <- 0
x1_2[i] <- 1
x2_2[i] <- 1
x3_2[i] <- 1
}
}
L1 <- lm(y ~ x1_1 + x2_1 + x3_1) # R^2 = 0.2344, p < 2.2e-16
L2 <- lm(y ~ x1_2 + x2_2 + x3_2) # R^2 = 0.9844, p < 2.2e-16
L3 <- lm(y ~ categories) # R^2 = 0.9844, p < 2.2e-16
By coincidence, the coding that I did as an alternative to the coding in the OP resulted in the same $R^2$ as the standard way of encoding the categorical variable. However, I had no way of knowing that when I started doing the alternative coding.
It is worth a mention that, when I change the y variable, that coincidence vanishes. Additionally, the p-values for the overall $F$-test differ between all three models. (This was probably true with the first simulation, but the p-values were so small that R cannot distinguish between them.)
set.seed(2022)
N <- 100
categories <- rep(c("a", "b", "c", "d", "e", "f", "g", "h"), rep(N, 8))
y <- rnorm(length(categories))
x1_1 <- x2_1 <- x3_1 <- x1_2 <- x2_2 <- x3_2 <- rep(NA, length(y))
for (i in 1:length(y)){
if (categories[i] == "a"){
x1_1[i] <- 0
x2_1[i] <- 0
x3_1[i] <- 1
x1_2[i] <- 0
x2_2[i] <- 0
x3_2[i] <- 0
}
if (categories[i] == "b"){
x1_1[i] <- 0
x2_1[i] <- 1
x3_1[i] <- 0
x1_2[i] <- 0
x2_2[i] <- 0
x3_2[i] <- 1
}
if (categories[i] == "c"){
x1_1[i] <- 0
x2_1[i] <- 1
x3_1[i] <- 1
x1_2[i] <- 0
x2_2[i] <- 1
x3_2[i] <- 0
}
if (categories[i] == "d"){
x1_1[i] <- 1
x2_1[i] <- 0
x3_1[i] <- 0
x1_2[i] <- 0
x2_2[i] <- 1
x3_2[i] <- 1
}
if (categories[i] == "e"){
x1_1[i] <- 1
x2_1[i] <- 0
x3_1[i] <- 1
x1_2[i] <- 1
x2_2[i] <- 0
x3_2[i] <- 0
}
if (categories[i] == "f"){
x1_1[i] <- 1
x2_1[i] <- 1
x3_1[i] <- 0
x1_2[i] <- 1
x2_2[i] <- 0
x3_2[i] <- 1
}
if (categories[i] == "g"){
x1_1[i] <- 1
x2_1[i] <- 1
x3_1[i] <- 1
x1_2[i] <- 1
x2_2[i] <- 1
x3_2[i] <- 0
}
if (categories[i] == "h"){
x1_1[i] <- 0
x2_1[i] <- 0
x3_1[i] <- 0
x1_2[i] <- 1
x2_2[i] <- 1
x3_2[i] <- 1
}
}
L1 <- lm(y ~ x1_1 + x2_1 + x3_1) # R^2 = 0.005842, p = 0.1979
L2 <- lm(y ~ x1_2 + x2_2 + x3_2) # R^2 = 0.001438, p = 0.7659
L3 <- lm(y ~ categories) # R^2 = 0.0123, p = 0.1981
0 0 0and then your codes, you would be suggesting that the difference between $D$ and $A$ was in effect a combination of the difference between $B$ and $A$ and the difference between $C$ and $A$ – Henry Oct 26 '22 at 13:15