A predictor that "becomes" categorical when larger than a cutoff

Question

I have a dataset where a predictor can have 2 states - if it's in state 1, then the value is always the same. If it's in state 2, then the value is continuous and changes. An example to this can be number of cigarettes smoked per day. If the person is a non-smoker, the value is always 0. If the person is a smoker, the value can be 0 or higher. In my model, I have 2 predictor variables - NoSmoke and Cig. Until recently, I thought this was the correct approach (based on another SO post that I can no longer find). However, I was recently told by a statistician that this approach is problematic - partly because the 2 predictors are independent, and partly because they're correlated.

When I run the model and plot the predictions, things seem to be working out. Is there an issue with this modeling approach? And if so - what's the correct way to analyze this, while preserving the continuous predictor values?

toy example with simulated data:

library(plyr)
library(dplyr)
library(tidyr)
library(ggplot2)
df1 <- data.frame(NoSmoke = "NoSmoke", Cig = rep(0, 100), Response = rnorm(100, 20, 5))
df2 <- data.frame(NoSmoke = "Smoke", Cig = rpois(100, 2)) %>%
        mutate(Response = 5 + 2 * Cig + rnorm(100, 20, 5))
df <- rbind(df1, df2)
mod <- lm(Response ~ NoSmoke + Cig, data = df)
new <- data.frame(NoSmoke = c("NoSmoke", rep("Smoke", 7)), Cig = c(0, 0:6))
preds <- as.data.frame(predict(mod, newdata = new, interval = "confidence"))
new$Pred <- preds$fit
new$Lower <- preds$lwr
new$Upper <- preds$upr
ggplot(df) +
    geom_point(aes(x = Cig, y = Response, colour = NoSmoke), position = position_dodge(width = 0.2)) +
    geom_point(data = new, aes(x = Cig, y = Pred, fill = NoSmoke), size = 2, shape = 21, colour = "black", 
        position = position_dodge(width = 0.2)) +
    geom_errorbar(data = new, aes(x = Cig, ymin = Lower, ymax = Upper, colour = NoSmoke), 
        position = position_dodge(width = 0.2))

EDIT

I left this out of the question for simplicity, but following some of the answers/comments - my actual continuous variable is actually continuous, as opposed to the number of cigarettes in the example (apologies, I was trying for simplicity). In terms of correlation between the NoSmoke and the Cig variables - in my analysis, I standardize Cig, so that the standardized value for the Smoke=="Smoke" cases is standardized as normal, and the standardized value for the Smoke=="NoSmoke" is zero. This allows the model to be Response = Intercept + beta_cig*Cig for the smokers and Response = Intercept + beta_nosmoke for the non-smokers. The correlation between the NoSmoke and the CigS is low - ~0.4 in the example, and <0.1 in my real data.

Here's the toy example with the standardization:

df <- df %>%
        group_by(NoSmoke) %>%
        mutate(CigS = (Cig - mean(Cig)) / sd(Cig)) %>%
        ungroup() %>%
        mutate(CigS = ifelse(NoSmoke == "NoSmoke", 0, CigS)) 
                NoSmoke = factor(NoSmoke, levels = c("Smoke", "NoSmoke")))
cor(as.numeric(df$NoSmoke), df$CigS)
mod <- lm(Response ~ NoSmoke + CigS, data = df)
new <- data.frame(NoSmoke = c("NoSmoke", rep("Smoke", 7)), Cig = c(0, 0:6)) %>%
        left_join(unique(select(df, NoSmoke, Cig, CigS)))
preds <- as.data.frame(predict(mod, newdata = new, interval = "confidence"))
new$Pred <- preds$fit
new$Lower <- preds$lwr
new$Upper <- preds$upr
ggplot(df) +
    geom_point(aes(x = Cig, y = Response, colour = NoSmoke), position = position_dodge(width = 0.2)) +
    geom_point(data = new, aes(x = Cig, y = Pred, fill = NoSmoke), size = 2, shape = 21, colour = "black", 
        position = position_dodge(width = 0.2)) +
    geom_errorbar(data = new, aes(x = Cig, ymin = Lower, ymax = Upper, colour = NoSmoke), 
        position = position_dodge(width = 0.2))

is this the post you had in mind? The combination of an indicator variable and a continuous variable works OK if you interpret the coefficients carefully, as explained in that post. — EdM, Feb 10 '23 at 20:03
Maybe tangetial: https://stats.stackexchange.com/questions/372257/how-do-you-deal-with-nested-variables-in-a-regression-model — COOLSerdash, Feb 10 '23 at 21:54
Note that no. cigarettes smoked per day isn't a continuous predictor. Anyway, the real issues raised by your model are:-is it reasonable to suppose that (1) for smokers, the response increases more or less linearly with no. cigarettes smoked per day, & (2) the errors have the same distribution for smokers & non-smokers? The latter point may merit an example: suppose your non-smokers include vapers & your response is nicotine levels in your subjects' bloodstream. — Scortchi - Reinstate Monica, Feb 10 '23 at 22:43
@EdM - it was actually this post, I just found it: https://stats.stackexchange.com/questions/56306/time-spent-in-an-activity-as-an-independent-variable. But I think both of these are suggesting the same solution, which is also the same solution as I'm using? — user2602640, Feb 11 '23 at 20:37
The edited question makes things clearer. Your interpretation of the coefficients seems OK, given that your use of the NoSmoke indicator variable is opposite in direction to that in the post to which I linked, where the indicator was 0 when the continuous predictor was necessarily 0. This approach leads to a potential discontinuity at CigS=0 between the NoSmoke groups. That's OK if it makes sense based on the subject matter. Interpretation might be easier, if all Cig values are non-negative, to use them without centering/scaling. — EdM, Feb 11 '23 at 21:06
The correlation between the two smoking-associated predictors isn't a problem. That will lead to a covariance in their coefficient estimates, but won't affect the model as a whole. For example, a test of the overall significance of smoking with respect to outcome could be a Wald test on both coefficients together, which would take that covariance appropriately into account. Also, note that your standardization of the Cig values included the 0 values for the NoSmoke=="NoSmoke" group; is that what you intended? I don't think that standardization helps you here. — EdM, Feb 11 '23 at 21:14
@EdM gah, sorry, was too quick in the edit. When standardizing, I do it after grouping by NoSmoke. I'm surprised to hear the correlation between the two variables doesn't matter, wasn't expecting that. So are you suggesting testing the full model against a model that omits both NoSmoke and Cig, to get a P-value for the significance of the combined covariates? — user2602640, Feb 11 '23 at 21:31
That's the best way to check the combined significance, if you don't mind fitting 2 models. Or you could do a Wald "chunk" test on both coefficients from the full model as outlined here. Correlations among predictors (aka "multicollinearity") gets a lot of bad press but, unless it's extreme, it just increases the variances of individual coefficient estimates. Model predictions are fine. A chunk test on predictor combinations can correct for their correlations, as illustrated here. — EdM, Feb 11 '23 at 21:49
@EdM Very cool. Want to write it all up in an answer? I'll gladly accept it. — user2602640, Feb 12 '23 at 01:08

score 8 · Answer 1 · answered Feb 10 '23 at 21:07

Having two features like those described by you is a perfectly normal and valid solution. It also arises in many different scenarios. For example, if you have two features: age and gender, and consider interaction age * gender, the values for it would be 0 for people whose gender was coded as 0 and non-negative otherwise (see also the link mentioned in the comment by EdM).

I cannot comment on what the statistician said because you didn't give us a full quote and context. Maybe they meant some specific scenario, but in general, such features and their interactions are commonly used.

score 8 · Accepted Answer · answered Feb 12 '23 at 19:17

@Tim, as usual, summarizes this well (+1): there is no problem with your "perfectly normal and valid solution." The following illustrates in a bit more detail.

On further review, I found your initial model much easier to interpret, with nonsmokers the reference group for the indicator variable and raw, non-negative and non-standardized values for Cig. That's the way that @whuber suggested in response to a very similar question. I used set.seed(20230212) before running your code. Then:

coef(mod)
#  (Intercept) NoSmokeSmoke          Cig 
#    19.920394     6.956709     1.511727

is easy to interpret: (Intercept) is the outcome estimate for non-smokers, NoSmokeSmoke is the extra outcome estimate for smokers if they had Cig=0, and Cig is the extra outcome beyond that per Cig. No need to deal with un-de-meaning, interpreting coefficients in a standard-deviation-of-the-predictor scale, or similar complications.

The correlation between the two predictors isn't a problem here. It seems large:

cor(df$NoSmoke=="Smoke",df$Cig)
# [1] 0.6980322

as another answer notes. Yes, it inflates the variances of the individual coefficient estimates, but not by much:

car::vif(mod)
#  NoSmoke      Cig 
# 1.950264 1.950264

A variance-inflation factor of that size isn't typically considered a problem. When you use the model for predictions there is a counterbalancing negative correlation between the coefficient estimates:

print(cov2cor(vcov(mod)),digits=3)
#              (Intercept) NoSmokeSmoke       Cig
# (Intercept)     1.00e+00       -0.506  9.41e-16
# NoSmokeSmoke   -5.06e-01        1.000 -6.98e-01
# Cig             9.41e-16       -0.698  1.00e+00

that leads to perfectly reasonable (and precise) predictions when the coefficient covariances are properly taken into account. You'll note that the regression restricted to the smokers gives the same result as the combined model:

coef(lm(Response~Cig,data=df,subset=NoSmoke=="Smoke"))
# (Intercept)         Cig 
#   26.877102    1.511727

with an (Intercept) that's the sum of the original model's (Intercept) plus its NoSmokeSmoke coefficient:

sum(coef(mod)[1:2])
# [1] 26.8771

So your solution for this type of data does not have the problems that one might have feared.

(+1) Note the regression model restricted to smokers will indeed give the same point estimates of the coefficients, but the variance estimates, & hence p-values & confidence intervals, will be a little different. — Scortchi - Reinstate Monica, Feb 12 '23 at 19:36

Pere · Answer 3 · 2023-02-12T21:04:40.233

Other answers are right but I think we should take in account that by including or not including the binary variable we are adjusting different models with underlying different assumptions.

Dropping the binary variable and only using the continuous variable means assuming that the effects of smoking are continuous at 0. That is, the effect of smoking very little (approaching zero) approaches the effect of not smoking.

You can see that in your example this assumption is false, because the expected response for a non smoker is 20 but the expected response for a smoker of 0 cigarettes is 25. Therefore for this example the model can match better the underlying problem when both predictors are included.

From actual data we might not know how the response behaves near zero. For some problems the knowledge on the problem may give some clue: for example, some contaminants are known not to have any safe dose, so we can suppose that very little exposure is going to produce a different response than no exposure at all and using both predictors may be useful. Other phenomena can behave differently.

In case of no prior knowledge, you can test if the categorical predictor is significant, just as with any other predictor. Here we are working in the opposite way, and finding reasons to discard it as not significant may suggest that the response is continuous at zero smoking and that may be an interesting result in itself.

Additionally, if your actual problem involves more variables, by discarding the categorical predictor you are assuming that the effects of those other variables are the same for smokers and non smokers, but by including the categorical predictor and its interactions with other variables you are actually adjusting two different models for smokers and non smokers. If you have an enough large dataset to adjust such a model without overfitting, it can give interesting results.

(+1) The code snippet to generate the response is a bit hard to follow but it boils down to Response = 5 + 2 * Cig + rnorm(2 * n, 20, 5) for both smokers and non-smokers. So while it's reasonable to assume that E(Y | non-smoker) might be different from E(Y | smoker of 0 cigarettes), in this example these two expectations are equal to 25. — dipetkov, Feb 12 '23 at 20:45
@dipetkov - Yes, it's hard to follow and my numbers are a bit off, but I my understanding is that for non-smokers the response is rnorm(100, 20, 5) while for smokers it is 5 + 2 * Cig + rnorm(100, 20, 5). Therefore, the E(Y|non-smoker)=20 but E(Y|non-smoker)=25, because the +5 term is only for non-smokers. Expectations are 20 and 25 instead of 0 and 5 as I said in the answer but they aren't still continuous. I missed the mean 20 in rnorm but it affects both subsets. — Pere, Feb 12 '23 at 20:54

David B · Answer 4 · 2023-02-10T21:30:01.467

3

I would agree with your statistician. There is a common question, with no perfect answer. Ultimately, it comes down to what your question is. The two most common approaches I see in addiction research are (1) define 'control' as drug-exposed non-users and then define a level of drug-use above-which a user is defined as a 'case' (non-exposed non-users and exposed low-users are removed). Then use these two groups for your analysis; or (2) exclude non-exposed non-users and treat level of use as a continuous variable (drug-exposed non-users are simply '0' on this scale).

If you check the correlation between your variables:

df$NoSmoke2 = (df$NoSmoke == "NoSmoke")*1
cor(df[,-1])

You'll see that NoSmoke and Cig are very highly correlated (with set.seed(1234) r = -0.73). This makes it quite hard to interpret the regression coefficients, or the significance of either variable. For example, let's say you want to know the effect of smoking on the response variable. What's the effect? Compare your effect of Non-smoking and your effect of cigarettes when they are entered together or separately.

The issue of sample restrictions has to do with trying to move beyond pure correlations. If you want to say something about the effect of smoking, then you should only be examining people who had an opportunity to smoke. Otherwise, reverse effects (Response leads to people being more likely to smoke) is a very real possibility.

edited Feb 10 '23 at 21:30

answered Feb 10 '23 at 20:07

David B

1,532

1

Could you clarify what the statistician's objection would be? Or point me to the literature? It's not apparent why these sample exclusion restrictions or redefinitions are necessary. – dimitriy Feb 10 '23 at 20:43
I edited my answer to clarify. – David B Feb 10 '23 at 21:48
1

I guess this doesn't bother me, since I would express the association as a function of correlated coefficients (like $\Delta = \beta_{smoker} + \gamma \cdot \mathtt{cigs})$ (possibly with cigs demeaned), so what matters is the joining significance of the linear function of coefficients. – dimitriy Feb 10 '23 at 21:56
3

The coefficient of NoSmoke estimates the difference in mean response between smokers who smoke no cigarettes over the observation/reporting period & non-smokers; the coefficient of Cig estimates, for smokers, the change in mean response for each cigarette smoked; while the intercept estimates the mean response for non-smokers: I see no difficulty in interpretation. – Scortchi - Reinstate Monica Feb 10 '23 at 23:37
@DavidB I added an edit to my post in regards to the correlation. Does that provide more info? – user2602640 Feb 11 '23 at 20:51

score 0 · Answer 5 · answered Feb 10 '23 at 20:29

0

I would have one variable such as $\tilde x=\max(x,0)$. This type of a variable is used in a linear spline in some packages such as Stata, where it can be created with mkspline function.

Normally, these are used to model varying slopes, e.g. you may have a model where the response to negative predictor is different from response to the positive one. In this case you create two variables: $x_+=\max(0,x),x_-=\min(0,x)$.

answered Feb 10 '23 at 20:29

Aksakal

61,310

6

If $x$ is no. cigarettes smoked per day, then $\max(x,0) = x$, as $x\geq0$. – Scortchi - Reinstate Monica Feb 10 '23 at 23:02

A predictor that "becomes" categorical when larger than a cutoff

5 Answers5

Linked