It crossed my mind that when designing an experiment and you're not interested in NHST but full regression model where coefficients for treatment exposure and relevant covariates are desired, perhaps it's best to discretize all continuous variables by default.
Whether a continuous covariate has a linear relationship with the response variable is likely not known a priori by scientists. For example, there was once a lot of interest in classroom sizes and effects on grades (as an instrument for the latent variable, learning). Classroom size was hypothesized to have had a negative linear relationship with grades. The smaller the class, the better. But in practice, the relationship was quadratic.
When classrooms were too large, students were socially discouraged from asking too many questions. Conversely when classrooms were too small, students had limited opportunity to hear questions (and answers) posed by other students that they might not have thought to ask, themselves. So there's an optimal classroom size and deviation in either direction is associated with a decrease in learning.
If classroom size is treated as a single continuous variable, the experiment's regression model would likely infer a near zero coefficient for classroom size. But if classroom size was discretized into binary buckets, ex: {[1,5), [5,15), [15,30), [30, 60], [60,100)} (all zeros would indicate that the classroom size is above 60, but this bucket is omitted to avoid inducing multicollinearity) and the model contained no other variables, the inferred coefficients would articulate the average grade for each group. And this removes the high bias of an assumed linear relationship in exchange for variance (the effects inferred could vary depending on the specific thresholds for each group.)
Of course, we wouldn't limit ourselves to a single variable w/o a model intercept; the above, simply illustrates that bias (assumed linear relationship) could be traded for variance (specific thresholds in discretization strategy). If we don't have a strong prior on the relationship between predictor variable and response variable, I'd argue that discretization is a better option to avoid false negatives (near zero coefficients with large confidence intervals.)
Philosophically, what is the community's opinion of discretizing all continuous variables by default?
Edit: the comments suggest usage of splines. My question is—-how can this be done in the context of causal inference and/or experimentation?
For example, if a difference-in-differences (DiD) design is employed, how can a regression spline be used?
DiD would typically assign one coefficient to each feature/factor/dimension. It’s unclear to me how it would consider a variable that is the spline predicted value of another variable…