Categorical variable disappears in Poisson GLM summary?

Question

For the variable SelfEthnicity there is meant to be 4 levels.

I have made it so there should not be a reference category, but the R output still only shows 3 Ethnicities.

> poissonmodel <-glm(rate~0+SelfEthnicity+Force, family=poisson,data=df_merge)
> dput(head(df_merge, 20))
structure(list(Force = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), levels = c("BTP", 
"Cumbria", "South-Wales"), class = "factor"), SelfEthnicity = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), levels = c("Asian", "Black", "Other", "White"), class = "factor"), 
    month = c("2017/10/01", "2017/11/01", "2017/12/01", "2018/01/01", 
    "2018/02/01", "2018/04/01", "2018/05/01", "2018/06/01", "2018/07/01", 
    "2018/08/01", "2018/09/01", "2018/10/01", "2018/11/01", "2018/12/01", 
    "2019/01/01", "2019/02/01", "2019/03/01", "2019/04/01", "2019/05/01", 
    "2019/06/01"), number = c(2L, 3L, 1L, 2L, 2L, 1L, 2L, 2L, 
    1L, 5L, 2L, 6L, 2L, 4L, 5L, 5L, 16L, 5L, 3L, 3L), roll_sum = c(2L, 
    5L, 6L, 8L, 10L, 11L, 13L, 15L, 16L, 21L, 23L, 29L, 29L, 
    30L, 34L, 37L, 51L, 55L, 56L, 57L), Resident Population = c(4213531, 
    4213531, 4213531, 4213531, 4213531, 4213531, 4213531, 4213531, 
    4213531, 4213531, 4213531, 4213531, 4213531, 4213531, 4213531, 
    4213531, 4213531, 4213531, 4213531, 4213531), rate = c(0, 
    0.001, 0.001, 0.002, 0.002, 0.003, 0.003, 0.004, 0.004, 0.005, 
    0.005, 0.007, 0.007, 0.007, 0.008, 0.009, 0.012, 0.013, 0.013, 
    0.014)), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), groups = structure(list(Force = structure(1L, levels = c("BTP", 
"Cumbria", "South-Wales"), class = "factor"), SelfEthnicity = structure(1L, levels = c("Asian", 
"Black", "Other", "White"), class = "factor"), .rows = structure(list(
    1:20), ptype = integer(0), class = c("vctrs_list_of", "vctrs_vctr", 
"list"))), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-1L), .drop = TRUE))

What exactly do you mean by "made it so there should not be a reference category?" How did you do that? Why did you do that? It seems natural that ethnicity would be a categorical variable and treated as such. There is not enough information here to help. How are the data structured? You can add the data structure by copying and pasting the output of dput(head(your_data, 20)) — jpsmith, Jan 18 '23 at 19:08
Putting in the intercept 0 to show that my reference category was in the force category. — , Jan 18 '23 at 19:17
Can you also add your model code? You may want to look over this SO help center page: How do I ask a good question? — jpsmith, Jan 18 '23 at 19:19
OP has a fundamental misunderstanding of what removing the intercept from the model does. It will only allow all levels if the first factor to be explicitly represented. There remains a linear dependency between the levels of the first factor and those of the others. Suggest migration to corss validated. — Limey, Jan 18 '23 at 19:27
See https://stats.stackexchange.com/questions/285210/what-to-do-in-a-multinomial-logistic-regression-when-all-levels-of-dv-are-of-int/544656#544656 for a way to include all levels of categorical variables in model output! — kjetil b halvorsen, Jan 23 '23 at 14:38

score 2 · Answer 1 · answered Jan 18 '23 at 21:23

Short answer: you don't have $3 \times 4 = 12$ or even $3 + 4 = 7$ independent factors; you only have $3 + (4-1)=6$ explanatory variables in this model. That's because one category in each factor variable after the first can act as the reference (whether or not you conceive of it that way) and every other value in that factor is compared to that reference.

Let's look at what you did.

Your data will be encoded in a "model matrix" (aka "design matrix") $X$ for all calculations, no matter what kind of regression model you are fitting (OLS, GLM, etc.). You can construct and inspect $X$ by means of the model.matrix function in R, as shown in the examples below.

To understand the model matrix, it helps to simplify: reduce your dataset to one record for each unique combination of (factor) variables that occurs. Here's an example (which I named "df"):

(df <- expand.grid(F = c("BTP", "Cumbria", "South-Wales"),
                 S =  c("Asian", "Black", "Other", "White")))

          Force SelfEthnicity
1          BTP         Asian
2      Cumbria         Asian
3  South-Wales         Asian
4          BTP         Black
5      Cumbria         Black
6  South-Wales         Black
7          BTP         Other
8      Cumbria         Other
9  South-Wales         Other
10         BTP         White
11     Cumbria         White
12 South-Wales         White

There are four levels of the SelfEthnicity variable and three levels of the Force variable.

Here is what you will get for $X$ with the formula in your question. I renamed the variables "F" and "S" to abbreviate the column headings and I have concatenated the model matrix with the previous data frame to show how each record is encoded.

    X <- model.matrix(~ 0 + F + S, df)
    (cbind(df, X))

             F     S FBTP FCumbria FSouth-Wales SBlack SOther SWhite
1          BTP Asian    1        0            0      0      0      0
2      Cumbria Asian    0        1            0      0      0      0
3  South-Wales Asian    0        0            1      0      0      0
4          BTP Black    1        0            0      1      0      0
5      Cumbria Black    0        1            0      1      0      0
6  South-Wales Black    0        0            1      1      0      0
7          BTP Other    1        0            0      0      1      0
8      Cumbria Other    0        1            0      0      1      0
9  South-Wales Other    0        0            1      0      1      0
10         BTP White    1        0            0      0      0      1
11     Cumbria White    0        1            0      0      0      1
12 South-Wales White    0        0            1      0      0      1

Notice there are only $6 = 3 + (4-1)$ columns (distinct variables), even though there are $3\times 4 = 12$ distinct combinations of the variable values. Everything you might want to know about the meaning and the interpretation of the model output is contained in this array.

Let's begin with the first line. The vector $(1,0,0,0,0,0)$ represents any individual in the (BTP, Asian) category. Thus, your model's estimate of ForceBTP (here shown as FBTP, corresponding to the first vector coordinate) is the response ("rate") for any individual in this category. A similar interpretation applies to the second and third columns.

The fourth column gets interesting: here is where the second (ethnicity) category is introduced. It shows up only in combination with other columns -- you never see a row where all components but the fourth are zero. For instance, record 4 for (BTP, Black) is coded by the vector $(1,0,0,1,0,0).$ This says the rate estimate for anyone in this category is the sum of the ForceBTP and SelfEthnicityBlack parameters. Reinterpreted, this tells us

SelfEthnicityBlack is the estimated difference in rates between the (BTP, Black) and (BTP, Asian) individuals.

In this sense, Asian is a "base" or "reference" category to which all other values of SelfEthnicity will be compared. You can confirm this by inspecting the fifth and sixth columns of the model matrix, labeled SOther and SWhite: their coefficients represent differences relative to Asian for any given value of Force.

There are plenty of other ways to encode the same model. For instance, look at what happens when you simply reverse the order in which you specify the categorical variables in the formula:

    X <- model.matrix(~ 0 + S + F, df)
    (cbind(df, X))

The result now is

             F     S SAsian SBlack SOther SWhite FCumbria FSouth-Wales
1          BTP Asian      1      0      0      0        0            0
2      Cumbria Asian      1      0      0      0        1            0
3  South-Wales Asian      1      0      0      0        0            1
4          BTP Black      0      1      0      0        0            0
5      Cumbria Black      0      1      0      0        1            0
6  South-Wales Black      0      1      0      0        0            1
7          BTP Other      0      0      1      0        0            0
8      Cumbria Other      0      0      1      0        1            0
9  South-Wales Other      0      0      1      0        0            1
10         BTP White      0      0      0      1        0            0
11     Cumbria White      0      0      0      1        1            0
12 South-Wales White      0      0      0      1        0            1

All the SelfEthnicity categories are explicitly named in the first four columns, while the missing BTP value for Force must now be its reference category. There are again $6 = 4 + (3-1)$ columns. Although the meanings of these columns have changed, the model is the same as before, because any parameter estimates from one can be suitably combined (through addition and subtraction) to yield corresponding estimates from the other. For instance, the rate for (Cumbria, White) (line #11) is the sum of the second and sixth parameters in the first model and it's the sum of the fourth and fifth parameters in the second model.

Generally, if we write the parameters as $\beta^\prime = (\beta_1,\beta_2,\ldots,\beta_k)$ (here, $p=k$), then the estimate for anyone in a row of the design matrix with values $\mathbf x = (x_1, x_2,\ldots, x_k)$ is

$$\mathbf x \beta = x_1\beta_1 + x_2 \beta_2 + \cdots x_p \beta_k.$$

With this "nice" design matrix, all the $x_i$ are zeros and ones, so the zeros disappear and the ones simply pick out the corresponding parameter values. Other design matrices might not be so nice in this regard: inspect X <- model.matrix(~ 0 + F + ordered(S), df) for example. This one has some irrational values. Even so, it represents the same model, just coded differently to simplify certain intended interpretations of parameter estimates.

I invite you to study the model matrices arising from the default "dummy coding" created by the formulas ~ F + S and ~ S + F, which do include an intercept term. In this way you might quickly gain a reliable, intuitive understanding of how dummy coding works, how to interpret regression output, and how to specify exactly the coding you want (in any software platform, not just R).

Hi, many thanks for your response. Does this mean that even though an estimate for Asian doesn't come up, that it is acting as a reference category for SelfEthnicity? @whuber ? — , Jan 19 '23 at 00:14
An estimate for Asian must also involve some reference value for Force. (It doesn't make any sense in isolation, because Force must have some value.) Take your pick. — whuber, Jan 19 '23 at 00:53
Please take a look at the columns in the examples again! They correspond to the output: for each column $j$ the software will output an estimate $\hat\beta_j.$ — whuber, Jan 19 '23 at 13:44
I think I understand now. So essentially you cannot get ceteris parabis with the model when interpreting the reference category? @whuber — , Jan 19 '23 at 16:26
Also, sorry but is the AIC = inf a problem for the model, I am not sure how to interpret this? @whuber — , Jan 20 '23 at 08:34

Categorical variable disappears in Poisson GLM summary?

1 Answers1

Linked