1

It crossed my mind that when designing an experiment and you're not interested in NHST but full regression model where coefficients for treatment exposure and relevant covariates are desired, perhaps it's best to discretize all continuous variables by default.

Whether a continuous covariate has a linear relationship with the response variable is likely not known a priori by scientists. For example, there was once a lot of interest in classroom sizes and effects on grades (as an instrument for the latent variable, learning). Classroom size was hypothesized to have had a negative linear relationship with grades. The smaller the class, the better. But in practice, the relationship was quadratic.

When classrooms were too large, students were socially discouraged from asking too many questions. Conversely when classrooms were too small, students had limited opportunity to hear questions (and answers) posed by other students that they might not have thought to ask, themselves. So there's an optimal classroom size and deviation in either direction is associated with a decrease in learning.

If classroom size is treated as a single continuous variable, the experiment's regression model would likely infer a near zero coefficient for classroom size. But if classroom size was discretized into binary buckets, ex: {[1,5), [5,15), [15,30), [30, 60], [60,100)} (all zeros would indicate that the classroom size is above 60, but this bucket is omitted to avoid inducing multicollinearity) and the model contained no other variables, the inferred coefficients would articulate the average grade for each group. And this removes the high bias of an assumed linear relationship in exchange for variance (the effects inferred could vary depending on the specific thresholds for each group.)

Of course, we wouldn't limit ourselves to a single variable w/o a model intercept; the above, simply illustrates that bias (assumed linear relationship) could be traded for variance (specific thresholds in discretization strategy). If we don't have a strong prior on the relationship between predictor variable and response variable, I'd argue that discretization is a better option to avoid false negatives (near zero coefficients with large confidence intervals.)

Philosophically, what is the community's opinion of discretizing all continuous variables by default?

Edit: the comments suggest usage of splines. My question is—-how can this be done in the context of causal inference and/or experimentation?

For example, if a difference-in-differences (DiD) design is employed, how can a regression spline be used?

DiD would typically assign one coefficient to each feature/factor/dimension. It’s unclear to me how it would consider a variable that is the spline predicted value of another variable…

jbuddy_13
  • 3,000
  • Besides the question listed as a duplicate, also see this page, this page, and their links. It's poor practice. A regression spline or other generalized additive model can let the data tell you the shape of the association between a continuous predictor and outcome, without the problems introduced by aggregating continuous data into arbitrary bins. – EdM Jan 06 '24 at 18:29
  • @EdM I’ve not heard of the use of splines in experimentation before. Could you link a resource on how splines might be used when the goal is to measure the causal effect of A on B? – jbuddy_13 Jan 06 '24 at 22:51
  • 1
    Frank Harrells' Regression Modeling Strategies is a superb, freely available resource on regression. Section 2.4 discusses regression splines. There are other ways to fit continuous predictors flexibly in regression models, sometimes grouped under the name "generalized additive models." See Chapter 7 of An Introduction to Statistical Learning. An outline of alternative approaches is on this page. – EdM Jan 07 '24 at 14:34
  • @EdM, voting to reopen as the resources you linked don’t seem to “play nicely” with experimental and causal inference methods such as DiD. – jbuddy_13 Jan 09 '24 at 08:37
  • In its current form, I think that the question is a bit too vague to reopen and a central part--"the community's opinion of discretizing all continuous variables by default"-- is answered by the links in comments. I would vote to reopen an edited question that describes a specific (if hypothetical) situation that you think binning would solve while splines wouldn't. Provide an example of the general issues that you raise in the Edit, in which you think that splines don't "play nicely" with causal inference methods. – EdM Jan 09 '24 at 13:14

0 Answers0