116

I'm wondering what the value is in taking a continuous predictor variable and breaking it up (e.g., into quintiles), before using it in a model.

It seems to me that by binning the variable we lose information.

  • Is this just so we can model non-linear effects?
  • If we kept the variable continuous and it wasn't really a straight linear relationship would we need to come up with some kind of curve to best fit the data?
Tom
  • 1,771
  • 2
  • 13
  • 18
  • 15
  • No. You are right that binning loses information. It should be avoided if possible. 2) Generally, the curve function that is consistent with theory behind the data is preferred.
  • – O_Devinyak Aug 31 '13 at 06:27
  • 11
    I don't know about benefits, but there are a number of widely-recognized dangers – Glen_b Aug 31 '13 at 08:28
  • 3
    A reluctant argument for it, on occasion: It can simplify clinical interpretation and the presentation of results - eg. blood pressure is often a quadratic predictor and a clinician can support the use of cutoffs for low, normal and high BP and may be interested in comparing these broad groups. – user20650 Aug 31 '13 at 16:30
  • 4
    @user20650: I'm not quite sure I understood you, but wouldn't it be better to fit the best model you can, & then use that model's predictions to say anything you want to say about broad groups? The 'high blood-pressure group' in my study won't necessarily have the same distribution of pressures as the general population, so their results won't generalize. – Scortchi - Reinstate Monica Sep 02 '13 at 18:25
  • @scortchi: I agree it is often not useful to use the actual data to decide the cut-offs – but in regards to my example, there are generally agreed pressures (~ +/-) that clinically indicate hyper and hypo-tension and normal range. How far over or under these thresholds the value falls may not be as important to the clinician as to the fact that they have been met. Also agree that not categorising is preferred but if the aim is simply to present associations with an outcome it is (in my opinion) sometimes difficult to present non-linear associations in a clear, easily interpretable manner.... – user20650 Sep 02 '13 at 23:09
  • 1
    @user20650: Like any presentation, it depends on the audience. From just graphs of predictors vs fitted responses for clients that only want a model overview/sense-check; up to details of restrictions, number & placement of knots for the statistically sophisticated. If there are important reference values for predictors or responses, as there often are, I discuss the model's behaviour with respect to them, display them on the graphs, & sometimes do calculations based on their population distribution & on the model fits. – Scortchi - Reinstate Monica Sep 03 '13 at 08:29
  • 2
    @user20650: Anyway, explaining necessarily complicated things as best you can goes with the job. I wouldn't expect a doctor to put me in for surgery rather than give me medicine just because it's easier for him to explain cutting out a part of my body than to explain how the drug works. – Scortchi - Reinstate Monica Sep 03 '13 at 08:52
  • 8
    The simplified clinical interpretation is a mirage. Effects estimates from categorized continuous variables have no known interpretation. – Frank Harrell Jun 30 '15 at 23:04
  • Also see https://stats.stackexchange.com/questions/104402/what-is-the-justification-for-unsupervised-discretization-of-continuous-variable – kjetil b halvorsen Dec 19 '18 at 12:22