Should discretized continous varibles be treated as numeric or ordinal (in a GLM)?

Question

I am uncertain about how to treat a discretized / binned continuous variable in the glm() function in R. I see two possible ways of feeding it to the glm. Either I input the binned variable as it is or I create a continuous numeric representation of it using as.integer()? What method would you consider "standard" out of these methods?

This is what I have tried: First, the continuous variable stored in my data is the age of an individual. Right now I have binned this continuous variable into the following levels: (16-21 22-27 28-33 34-39 40-45 46-51 52-57 58-63 64-69 70+). Assume that the binned variable is called ageBinned.

Now I am uncertain about how to feed this grouped variable to the glm() function after binning it. Right now I have ordered the groups using factor() and relevel(). When I fit the GLM based on this covariate I am uncertain about how to interpret the result.

Model Fit using ageBinned

poisson.glm <- glm(NoClaims ~  ageBinned, family = poisson(link=log), 
                   data=data, offset=log(Duration))

I get the following output:

Coefficients:
   (Intercept)  ageBinned22-27  ageBinned28-33  ageBinned34-39  ageBinned40-45  
      -2.23763         0.43223         0.43151         0.37040         0.31978  
ageBinned46-51  ageBinned52-57  ageBinned58-63  ageBinned64-69    ageBinned70+  
      -0.21415        -0.80053        -0.08639        -0.27468        -0.74130

Model Fit using as.integer(ageBinned):
If I instead treat the binned group as numeric using as.integer(ageBinned), I get the following result:

 (Intercept) as.integer(ageBinned) 
 -1.80403065           -0.03616828

Questions:

When I look at the second output, when I use as.integer(ageBinned), I interpret "Intercept" as $\beta_0$ and the second output parameter as $\beta_{age \; group}$. However, I do not know how to interpret the output from the first glm() where I have used ageBinned.
What method would you consider "standard" out of these methods?
How do the values from ageBinned relate to regression parameters $\beta_{\rm age \; group}? $ Is there still a single common $\beta_{\rm age \; group}?$ Is the relationship between the covariates and the regression parameter still in the following form?

\begin{equation} \log(\mu_i) = \beta_0 + \beta_{\rm age \; group}\cdot x \end{equation}

UPDATE

It appears as though making ageBinned into an ordinal categorical variable is the best alternative for me. However, I am not entirely sure exactly how to achieve this. I attempted to order the ageBinned variable through the following command

data$ageBinned = factor(data$ageBinned ,
                              ordered = TRUE,
                              levels = c("16-21", "22-27", "28-33", "34-39",
                                         "40-45", "46-51", "52-57", "58-63","64 69", "70+"))

By putting these into the glm() function, I then receive the following parameters

                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -2.2939     0.1425 -16.095   <2e-16 ***
claim.data$age.group.factor.L  -1.0050     0.5865  -1.713   0.0866 .  
claim.data$age.group.factor.Q  -0.3142     0.5650  -0.556   0.5781    
claim.data$age.group.factor.C   0.4275     0.5231   0.817   0.4138    
claim.data$age.group.factor^4  -0.4126     0.4821  -0.856   0.3921    
claim.data$age.group.factor^5  -0.3993     0.4590  -0.870   0.3843    
claim.data$age.group.factor^6  -0.1530     0.3979  -0.385   0.7005    
claim.data$age.group.factor^7   0.3577     0.3413   1.048   0.2946    
claim.data$age.group.factor^8   0.3474     0.3202   1.085   0.2779    
claim.data$age.group.factor^9   0.0819     0.2663   0.308   0.7584

Questions

Is this the correct way of ordering the variables?
If so how does this output relate to the regression parameter $\beta_{age}$?
If I want to compute the log-likelihood of this model without using a R package, then I need to be able to compute \begin{equation} \log(\mu_i) = \beta_0 + \beta_{\rm age \; group}\cdot x \end{equation} how do I achieve this with the ordered categorical variables (what would I put in for x)?

Please say more about what you are trying to accomplish with binning, as it is generally not a good idea. A nonlinear--even non-monotoninc--relationship with outcome can be modeled with splines. Then you can illustrate the results for particular individual ages. — EdM, Jun 23 '20 at 23:49
I want to bin it because I want to use the result from this model as a prior and the current model is binned — MarG, Jun 24 '20 at 08:54
To be clear: using age as an ordered, binned category is not "he best alternative for" you. The best alternative is to model age as continuous. You can then average over the predictions if needed. — gung - Reinstate Monica, Jun 25 '20 at 00:48
Your ordering of the categorical predictor seems OK, but binning is still not a good idea. There is no simple relationship between the single slope coefficient you get with a linear fit and the multiple regression coefficients you get from the high-order polynomial fit. Looks like none of the coefficients in that fit are significant, anyway. With a continuous spline you could test for deviations from linearity and maybe even find that there is no association of age with claims at all, simplifying your problem. Don't know how to do the log-likelihood; you could check the open-source R code. — EdM, Jun 25 '20 at 12:51
@EdM But am I still even using a GLM if I start using this spline approach? — MarG, Jun 25 '20 at 13:05
Yes. Think of a spline as a transformation of the predictor, like using the logarithms instead of original values of a positive variable. The right side of your glm() formula just provides what's called the linear predictor. In a standard linear regression, the linear predictor is itself the prediction. In a Poisson GLM the linear predictor models the logarithm of the expected humber of counts (link function) and the variance is modeled as equal to the mean. Other GLMs have different link functions and variance assumptions, but the structure of the linear predictor is the same for all. — EdM, Jun 25 '20 at 13:47

EdM · Answer 1 · 2020-06-24T16:46:39.523

5

Even though it looks like you still only have one predictor when you write the model

glm(NoClaims ~  ageBinned)

what you've actually done by binning is to define a whole new set of predictors, with one predictor for every bin beyond the first. In your case that is 9 predictors. (The 16-21 group is the reference.)

It's possible to specify that the bins represent progressive levels of an ordinal predictor, but you haven't done that. Thus your model will treat each age bin separately despite the natural ordering by age.

So there is no longer a single $\beta_{age}$. In your model the intercept is the value for the reference age bin (16-21) and (with the usual default "treatment contrasts" coding of a categorical predictor) each of the 9 coefficients represents the difference of a bin from the reference bin.

You've added 8 predictors beyond what you would have in the simple model with age as a continuous linear predictor, set arbitrary cutoffs that make predictions for a 57 year old markedly different from those for a 58 year old, and thrown away the information provided by the natural ordering of ages. Those are among the reasons that binning is not a good idea.

If you use as.integer(age.group) as the predictor you are making the assumption that the successive difference between each age group is the same. The intercept is the value for the age 16-21 reference group, and the slope is the change for each additional bin beyond that. With evenly spaced groups like yours that's assuming a linear relationship with age (except for the highest 70+ group). That doesn't really win you anything over a model using age itself as a linear predictor. You still throw away the possibility of a non-linear contribution of age to outcome.

Treating your groups as ordinal predictors would better respect the natural ordering, as the default coding in R would be "polynomial contrasts". The resulting coefficients aren't easily interpreted in terms of the original bins, but predictions for any particular age can be obtained with the predict() function. You still, however, will have 9 coefficients to estimate beyond the intercept.

If you need estimates at particular ages or age ranges for a downstream application you are much better off doing a continuous regression model, with restricted cubic splines of age as the predictor. You will probably only need to add 1 to 3 extra predictors via the spline model to get a reasonable fit beyond the linear model for age, versus the 8 extra with your bins. That lessens the risk of overfitting, so your model is more likely to generalize well. Then, for the downstream application, extract predictions for the particular example ages or age ranges from the continuous model, using predict(). That, rather than prior binning, would be the "standard" approach to your problem.

edited Jun 24 '20 at 16:46

answered Jun 24 '20 at 14:14

EdM

92,183
10
92
267

Ok, so I get that using the binned categories as they are is not a good idea. What would be the difference between using as.integer(age.group) and specifying that the bins represent progressive levels of an ordinal predictor? Here link they mention that ordinal variables can be treated as numeric if the numerical distance between each set of subsequent categories is equal.
About the negative parameters. I accidentally included the exponential of the parameters for one model.
– MarG Jun 24 '20 at 14:57
Thank you for the response! – MarG Jun 24 '20 at 15:00
1

@MarG with an ordinal predictor, instead of the "treatment contrasts" default (reference level and differences from the reference), R specifies what are called "polynomial contrasts" for the categories, which better respect the ordering of the levels. Then you use a predict() function for particular cases. You also can penalize differences between coefficients to minimize abrupt differences from bin to bin. But why not just model continuously first, then predict from that? – EdM Jun 24 '20 at 15:56
It is a bit hard to briefly explain but, I want to bin the continuous covariate into intervals because I will use this output in another model where I need to estimate an expectation of the number of claims over each individual age interval. However, I have a small dataset over a short time period. So, if I do this for each unit age increase (16,17,...) then I get a lot of ages where there are no recorded claims. Or for a particular age group like 66-year-olds I have one recorded policyholder who has made many claims. So if I bin the data to me it seems like the method becomes more regularized – MarG Jun 24 '20 at 16:33
Also, when I add more variables into the mix there will be a lot of combinations of covariates where there are no claims at all. I also have another covariate which has values in the range: (29575, 2739284). If I don't discretize this covariate in particular, the claim expectation for most combinations of covariates would be zero. – MarG Jun 24 '20 at 16:46
A comment unrelated to my two comments above: In the literature, they mention that data points (policies) with the same combination of covariates should be homogenous, meaning that they have the same probability distribution. So, does it really make sense to not group for instance 27 and 28-year-olds? I don't think that their tendency to make claims will differ based on their age? – MarG Jun 24 '20 at 16:51
1

@MarG with a continuous model for age it wouldn't matter whether your data set lacks particular combinations of covariates. You can still extract predictions for any desired age or range of ages, as the model will interpolate appropriately. That's even more important when additional covariates are considered. The problems posed by an outlier like that one policyholder should be handled better by a continuous model, as the effect will then be spread out over the entire data set rather than restricted to a single age bin. Apparent outliers need careful consideration in any event. – EdM Jun 24 '20 at 16:56
1

@MarG homogeneity at any combination of covariate values is an assumption of all these models. Age is a covariate. All 26-year-olds with a particular set of other covariate values should have the same probabilities of claims, but their probabilities could differ from those of 27-year-olds with the same other covariate values. In that case binning could violate the homogeneity assumption. With a continuous age model you can test directly and efficiently whether "their tendency to make claims will differ based on their age" and modify your model accordingly. – EdM Jun 24 '20 at 17:13
I updated the question above. – MarG Jun 24 '20 at 21:54

score 3 · Answer 2 · answered Jun 24 '20 at 21:06

@EdM has provided a good answer. Binning is not a good idea in general, or here specifically. Let me add a couple complementary points.

I wouldn't trust "the literature" that there isn't a difference between 26 and 27 year olds. It is to be expected that the differences between nearly identical values will be very small. There will be no power to detect those differences. Note that using bins assumes there there is a meaningful difference between 27 and 28 year olds.
In R, if you use as.integer(ageBinned), you convert the ageBinned levels into 1,2,3,...,10 (whereas the binning converted the original values into unrelated bins). This means you are fitting a series of constantly incremented step functions.
If you make the categorical ageBinned variable into an ordinal categorical variable, you will use the same number of degrees of freedom, they'll just be decomposed into linear and increasingly complex curvilinear fits.

Your best bet is to refit the subsequent model (or find a better one) so that you can use age as continuous.
Assuming you can't, use age as continuous here, then average over the predicted values from this model ($\hat{y}$s) within each bin to get the values you'll use for the subsequent model.
Note that you'll need to make some assumptions about the distribution of ages within the bins for that. There may be some data (e.g., census) you can use, but it also may not make much difference and you could just use a uniform distribution within each bin.

Thanks for the reply! What I find a bit confusing is discretizing seems to be very common in insurance literature. Yet, many there on the forum seem to be against it. — MarG, Jun 24 '20 at 21:36

Should discretized continous varibles be treated as numeric or ordinal (in a GLM)?

UPDATE

Questions

2 Answers2

Linked