2

I'm trying to fit a GLM on some data and I feel like there should be an interaction term between two of the explanatory variables (one categorical and one discrete) but all the non-zero instances of the discrete variable occur on the "1" state of the categorical variable (partly why I feel like there should be an interaction). When I put the interaction in the glm (var1*var2), it just shows N/A for the interaction term (var1:var2) in the summary ANOVA.

I have Included a mock example below

a <- data.frame("y" <- c(0,1,2,3),
                "var1" <- c(0,1,1,1),
                "var2" <- c(0,0,1,2))
a.glm <- glm(y ~ var1*var2, family=poisson, data = a)
summary(a.glm)

and then this shows up in the console:

Call:
glm(formula = y ~ var1 * var2, family = poisson, data = a)

Deviance Residuals: 1 2 3 4
-0.00002 -0.08284 0.12401 -0.04870

Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -22.303 42247.166 0.00 1.00 var1 22.384 42247.166 0.00 1.00 var2 0.522 0.534 0.98 0.33 var1:var2 NA NA NA NA

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 4.498681  on 3  degrees of freedom

Residual deviance: 0.024614 on 1 degrees of freedom AIC: 13.63

Number of Fisher Scoring iterations: 20

This is the table giving the mean of y for each combination in my actual data.
| | 0 | 1 | 2 | 3 |

| 0 | 1.592 | N/A | N/A | N/A |

| 1 | 1.859 | 1.759 | 1.543 | 0.813 |

|mean| 1.721 | 1.759 | 1.543 | 0.813 |

I'd rather not make var2 categorical as there clearly seems to be a negative correlation between var2 and y which is being overshadowed by the var1 = 0 values. (there are relatively few observations of var2 = 2 and 3 which does not help overcome this effect)

Any help would be appreciated!

Thank you!

Kon-kon
  • 21
  • You say "all the non-zero instances of the discrete variable occur on the "1" state of the categorical variable" but either this is not reflected in your mock example OR some zero-cases of the discrete variable also show up in the "1" state of your categorical. Which is it? – AdamO Jan 20 '23 at 17:13
  • All the non-zero cases and some zero cases are in the "1" state and only zero cases are in the "0" state. I'm pretty sure the "all the non-zero instances of the discrete variable occur on the "1" state of the categorical variable" and "some zero-cases of the discrete variable also show up in the "1" state of the categorical" are not contradictory. In either case, that's beside the point. My data is similar to the example. – Kon-kon Jan 20 '23 at 17:16
  • In your example, the product of var1 and var2 is perfectly collinear with var2. That seems to be the problem. – Dave Armstrong Jan 20 '23 at 17:52
  • Oh that makes sense. Do you have any idea on how I could model the interaction then? – Kon-kon Jan 20 '23 at 18:02
  • @Kon-kon in your toy example, y is generated according to a very specific process. I think you'll find if you generate more samples, and perhaps some randomness to the response, you will solve this problem. – AdamO Jan 20 '23 at 18:49

1 Answers1

0

One way to deal with this is to set up a new categorical predictor based on the combinations of predictor values for which you have data. A full interaction term between a binary predictor and a 3-level categorical predictor would require fitting 5 coefficients. You only have 4 combinations with values. So you could define a new 4-level categorical predictor $x_{ij}$ with $i$ being the level of var1 and $j$ being the corresponding level of var2. You would define $x_{00}$, $x_{10}$, $x_{11}$ and $x_{12}$ in your example. Then use $x_{ij}$ as the predictor in your model instead of var1 and var2.

That allows you to evaluate the overall association of the predictors with outcome and differences among particular combinations of var1 and var2. That's not the same as a standard interaction term, but it's something you can do with the data you have.

If you want to treat var2 as a linear predictor (with only a few levels), and if the nature of the subject matter is such that you can have non-zero values for var2 only when var1=1 (with var2=0 still possible when var1=1), then you can use the approach on this page. Write your model without the interaction term:

glm(formula = y ~ var1 + var2, family = poisson, data = a)

The Intercept of the model is the estimated outcome (in the log-link scale) for var1=0 (for all of which var2=0 also). The coefficient for var1 is the estimated outcome at var2=0 when var1=1. The coefficient for var2 then evaluates your hypothesis that there is a linear relationship between var2 and outcome.

If there could theoretically be non-zero values of var2 when var1=0 you can still model var2 continuously as above, but then you have to be very careful when interpreting your data. You only have information about the association of var2 with outcome when var1=1, you can't say anything about the association of var2 with outcome when var1=1, and you thus have no information about a possible interaction between var1 and var2.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Thanks a lot, I'll try that if there is no better way. I'm pretty sure there is a relation between y and var2 though which gets ignored when the 0 values for var 2 are pooled together, so I'd rather not make var2 categorical. – Kon-kon Jan 20 '23 at 18:26
  • @Kon-kon I've added a suggestion for how to treat var2 as a linear predictor. You don't get a formal "interaction" evaluation, but you can evaluate its association with outcome for the var1=1 situation, as well as the association of var1=1 with outcome in the var2=0 situation. – EdM Jan 20 '23 at 19:05
  • Thank you! My data is about pandas and var1 is if they are "paired" and var2 is how many offspring they have, so none of the unpaired pandas have offspring but the issue is that some of the paired ones do, so sadly I don't think your answer works. It was quite insightful though. – Kon-kon Jan 20 '23 at 21:30
  • @Kon-kon the situation you describe is what my later suggestion covers. If I understand correctly, unpaired pandas (var1=0) by necessity have 0 offspring. Paired pandas might have 0, 1, 2 or more. Your var1 is like the "loan indicator" on the page I linked, and var2 is like the "loan amount." It's not a problem that some paired pandas also have 0 offspring; the coefficient for var1 is the difference between paired and unpaired at 0 offspring. The coefficient for var2 is the change in outcome beyond that for each additional offspring. – EdM Jan 20 '23 at 21:59
  • Thank you I'll give it another read. I thought you mentioned that it var1=1 needed to have only non-zero offspring. – Kon-kon Jan 20 '23 at 22:06
  • @Kon-kon I see where the confusion might have arisen. I'll reword. – EdM Jan 20 '23 at 22:18