I am uncertain about how to treat a discretized / binned continuous variable in the glm() function in R. I see two possible ways of feeding it to the glm. Either I input the binned variable as it is or I create a continuous numeric representation of it using as.integer()? What method would you consider "standard" out of these methods?
This is what I have tried: First, the continuous variable stored in my data is the age of an individual. Right now I have binned this continuous variable into the following levels: (16-21 22-27 28-33 34-39 40-45 46-51 52-57 58-63 64-69 70+). Assume that the binned variable is called ageBinned.
Now I am uncertain about how to feed this grouped variable to the glm() function after binning it. Right now I have ordered the groups using factor() and relevel(). When I fit the GLM based on this covariate I am uncertain about how to interpret the result.
Model Fit using ageBinned
poisson.glm <- glm(NoClaims ~ ageBinned, family = poisson(link=log),
data=data, offset=log(Duration))
I get the following output:
Coefficients:
(Intercept) ageBinned22-27 ageBinned28-33 ageBinned34-39 ageBinned40-45
-2.23763 0.43223 0.43151 0.37040 0.31978
ageBinned46-51 ageBinned52-57 ageBinned58-63 ageBinned64-69 ageBinned70+
-0.21415 -0.80053 -0.08639 -0.27468 -0.74130
Model Fit using as.integer(ageBinned):
If I instead treat the binned group as numeric using as.integer(ageBinned), I get the following result:
(Intercept) as.integer(ageBinned)
-1.80403065 -0.03616828
Questions:
- When I look at the second output, when I use
as.integer(ageBinned), I interpret "Intercept" as $\beta_0$ and the second output parameter as $\beta_{age \; group}$. However, I do not know how to interpret the output from the first glm() where I have usedageBinned. - What method would you consider "standard" out of these methods?
- How do the values from
ageBinnedrelate to regression parameters $\beta_{\rm age \; group}? $ Is there still a single common $\beta_{\rm age \; group}?$ Is the relationship between the covariates and the regression parameter still in the following form?
\begin{equation} \log(\mu_i) = \beta_0 + \beta_{\rm age \; group}\cdot x \end{equation}
UPDATE
It appears as though making ageBinned into an ordinal categorical variable is the best alternative for me. However, I am not entirely sure exactly how to achieve this. I attempted to order the ageBinned variable through the following command
data$ageBinned = factor(data$ageBinned ,
ordered = TRUE,
levels = c("16-21", "22-27", "28-33", "34-39",
"40-45", "46-51", "52-57", "58-63","64 69", "70+"))
By putting these into the glm() function, I then receive the following parameters
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2939 0.1425 -16.095 <2e-16 ***
claim.data$age.group.factor.L -1.0050 0.5865 -1.713 0.0866 .
claim.data$age.group.factor.Q -0.3142 0.5650 -0.556 0.5781
claim.data$age.group.factor.C 0.4275 0.5231 0.817 0.4138
claim.data$age.group.factor^4 -0.4126 0.4821 -0.856 0.3921
claim.data$age.group.factor^5 -0.3993 0.4590 -0.870 0.3843
claim.data$age.group.factor^6 -0.1530 0.3979 -0.385 0.7005
claim.data$age.group.factor^7 0.3577 0.3413 1.048 0.2946
claim.data$age.group.factor^8 0.3474 0.3202 1.085 0.2779
claim.data$age.group.factor^9 0.0819 0.2663 0.308 0.7584
Questions
- Is this the correct way of ordering the variables?
- If so how does this output relate to the regression parameter $\beta_{age}$?
- If I want to compute the log-likelihood of this model without using a R package, then I need to be able to compute \begin{equation} \log(\mu_i) = \beta_0 + \beta_{\rm age \; group}\cdot x \end{equation} how do I achieve this with the ordered categorical variables (what would I put in for x)?
glm()formula just provides what's called the linear predictor. In a standard linear regression, the linear predictor is itself the prediction. In a Poisson GLM the linear predictor models the logarithm of the expected humber of counts (link function) and the variance is modeled as equal to the mean. Other GLMs have different link functions and variance assumptions, but the structure of the linear predictor is the same for all. – EdM Jun 25 '20 at 13:47