5

For example, if I want to determine the effect of rain on daily traffic volume. I would include a binary variable whether or not it rained, and a continuous variable for the amount of rain. Would this be redundant?

  • 2
    There are a number of similar Q&A on this site https://stats.stackexchange.com/questions/511208/what-are-the-pitfalls-of-including-a-continuous-variable-and-a-discretized-versi?rq=1 https://stats.stackexchange.com/questions/622282/for-a-logistic-regression-how-to-include-continuous-independent-variable-that-a?rq=1 – mdewey Nov 22 '23 at 14:18
  • See https://stats.stackexchange.com/a/4833/919 for an example. Or https://stats.stackexchange.com/a/1795/919. Or https://stats.stackexchange.com/a/372258/919. Or https://stats.stackexchange.com/a/6565/919. Or many more found with this site search. – whuber Nov 22 '23 at 14:38

3 Answers3

2

You could do this, however it is not without it's problems. If you do, then the coefficient for the indicator will tell you if rainy days differ from dry days, and the continuous variable will tell you if the amount of rain makes a difference if it has rained. On the other hand the indicator will also be correlated with amount of rain so it could lead to an inflated standard error. We can demonstrate this with a simple simulation in R:

set.seed(1)

n_sim <- 1000 simvec_est_1 <- numeric(n_sim) simvec_SE_1 <- numeric(n_sim)

simvec_est_2 <- numeric(n_sim) simvec_SE_2 <- numeric(n_sim)

N <- 100

for(i in 1:n_sim) {

x1 <- rnorm(N, 0, 1) x1 <- x1 * (x1 > 0) # non-rainy days

x2 <- rep(1,N) x2 <- x2 * (x1 > 0) # non-rainy days

y <- 10 + x1*3 + rnorm(N,0,5) # fixed effect of 3 for x1

simvec_est_1[i] <- summary(lm(y ~ x1))$coef[2,1] simvec_SE_1[i] <- summary(lm(y ~ x1))$coef[2,2]

simvec_est_2[i] <- summary(lm(y ~ x1 + x2))$coef[2,1] simvec_SE_2[i] <- summary(lm(y ~ x1 + x2))$coef[2,2] }

mean(simvec_est_1); mean(simvec_SE_1[i]) mean(simvec_est_2); mean(simvec_SE_2[i])

which results in:

[1] 3.011803
[1] 0.8578615

[1] 3.042935 [1] 1.159548

As we can see, the estimates are unbiased, but the standard error is inflated.

Robert Long
  • 60,630
  • I beg strongly to differ: see the links in a comment to the question. The concern about SEs is valid, but the conclusion that this is "not a good idea" does not stand up. To support that opinion, you need to offer a superior alternative. What would it be? – whuber Nov 22 '23 at 14:39
  • @whuber I suppose I was thinking that the procedure of dichotomising a continuous variable is usually thought of as a bad idea, so I would only use the continuous variable. Am I missing something here ? – Robert Long Nov 22 '23 at 14:49
  • 1
    Yes -- it's all explained in the links I provided. This is not mere dichotomizing: it is in effect creating a hierarchical model in which the zeros are treated as a special condition. – whuber Nov 22 '23 at 14:49
  • 1
    @whuber I have edited my answer. I hadn't noticed the links you posted until you commented here. Thanks ! – Robert Long Nov 22 '23 at 15:31
  • 1
    +1. But ordinarily I wouldn't be worried about "inflation" of the SE, because I am left wondering "inflated with respect to what?" If the model with a zero indicator fits better overall, then the model without it simply is a poor one and any comparison to the SE of the coefficient of the explanatory variable is meaningless anyway. – whuber Nov 22 '23 at 15:38
2

As long as you include the variables together I do not see the problem here. The coefficient for the binary variable will tell you if rainy days differ from dry days and the continuous variable will tell you, conditional on it having rained, does how much it rains make a difference.

The important thing here is to focus on the joint effect of the two variables by fitting a model with both of them (plus other covariates) and then the same model but without the two variables. That gives you the overall effect of your composite variable. Since you know that the two variables are related it seems pointless to test for the variance inflation factor as you already know the answer.

mdewey
  • 17,806
0

If you're asking on a general level, if you can include the continuous variable and its indicator in the same model, I don't think anyone here can tell you if you can or can't have both variables. It's not that the statistical modelling police is going to arrest you or your program would crash...

If you ask specifically for the variables in your example, I also can't say anything for sure, but I'm afraid there's a high chance for seeing weird things in your model. You can, for example, see that the indicator variable has coefficient zero, wrongfully inferring that there's no link between traffic jams and an indicator for rain. On the other hand, including both variables may lower your error.

Some guiding questions:

  1. What is the purpose of the model? Is it a descriptive model? Is it a predictive one?
  2. What other variables do you have? Is there an intercept? If there isn't an intercept your model would predict zero traffic volume for non-rainy days. How are you with that?

In addition, I don't think that your variables must have a high Pearson correlation coefficient.

Let us denote the indicator variable by $X$ and the continuous by $Y$. Assuming both nonnegative and not constant. $$ Corr(X,Y) = \frac{E(XY)-E(X)E(Y)}{\sqrt{Var(X)Var(Y)}}. $$

Denote by $p$ the proportion of rainy days in your data. I Assume $1 > p > 0$.

Since $XY$=$Y$, $E(XY)=E(Y)$. $X$ is a binary rv, hence $E(X) = p,\:Var(X)=p(1-p)$.

So your correlation coefficient will be: $$ Corr(X,Y) = \frac{E(Y)-pE(Y)}{\sqrt{p(1-p)Var(Y)}} = \sqrt{\frac{1-p}{p}}\frac{E(Y)}{\sqrt{Var(Y)}}. $$

$Corr(X,Y)$ can have any value in $(0,1]$, depending on $p$ and $Y$.

  • I like the idea of the analysis at the end, but it's worked incorrectly. $X$ must be the indicator that $Y=0,$ for otherwise it's useless. But then $XY=0$ and $E(XY)=0.$ – whuber Nov 22 '23 at 15:35
  • I'm not sure what the OP wanted to ask. I assumed the indicator is for $Y=1$. Anyways people here in the comments were talking about correlation between the indicator and the continuous variable. That's not necessarily the case. – Alex Teush Nov 22 '23 at 17:59
  • 1
    That assumption is erroneous and makes no sense anyway. A "binary variable whether or not it rained" refers to $X,$ not $Y.$ If you would review some of the threads I linked to, you can read why. – whuber Nov 22 '23 at 18:23
  • I fully understand why it's more efficient to set the indicator as you proposed. However: 1) Neither of us knows what was going on in @Marc Ignacio 's head while he was writing the question. He's more than welcome to clarify. 2) Some people here, like Robert Long , considered the case $\mathcal{I}(Y=1)$ saying that it will correlate with $Y$. I showed otherwise. 3) Even though less efficient, setting the indicator $\mathcal{I}(Y=1)$ might reduce the error, so I'm not sure it's useless. Briefing the threads you referred to, I didn't see if they contradict that. – Alex Teush Nov 23 '23 at 05:53
  • It is useless as demonstrated in my linked posts. Briefly, any model that includes an intercept and $\alpha X+\beta Y$ in its expression is equal to $\alpha X$ when $X=0$ (because then $Y=0$) and otherwise is equal to $\beta X + 1$ and the additive $1$ is absorbed in the intercept. Consequently, $Y$ is completely redundant. It's irrelevant whether something that doesn't work is more efficient. – whuber Nov 23 '23 at 18:24
  • I asked the OP if they're talking about a model with an intercept or without. What if the model is without intercept? – Alex Teush Nov 23 '23 at 20:15
  • Then it depends on whether the intercept can be represented by some linear combination of the explanatory variables. But now that you have seen how to analyze the situation, I'm sure you can work that out. – whuber Nov 24 '23 at 15:05
  • It really makes no difference whether the model is with an intercept or the intercept is a linear combination of other variables. Bottom line is that, in some cases, indicator $\mathcal{I}(X=1)$ can reduce the error. Therefore it isn't necessarily useless. – Alex Teush Nov 27 '23 at 05:49