I'm running an OLS and one of the explanatory variables accounts the sewage coverage, which is zero in 42% of the observations (these are true zeros). I'm worried that the linear effect is not going to be captured very well. Does it make sense to insert a dummy in the model that equals 1 if the sewage coverage is nonexistent and 0 if it exists?
-
I would change your modeling setup and run a 2-stage model. Check this answer here: https://stats.stackexchange.com/a/571172/32477 – Stefan Apr 15 '22 at 15:50
-
1@Stefan the 0 values here seem to be in a predictor rather than the outcome. – EdM Apr 15 '22 at 15:52
-
@Edm Right, thanks for pointing this out! I misread it. – Stefan Apr 15 '22 at 15:54
1 Answers
This is a reasonable way to proceed if you know that these are true 0 values that make the corresponding continuous value meaningless. This answer from @whuber illustrates how to proceed in such a situation, where loan values are used as continuous predictors but are necessarily 0 if there is no loan at all. In this scenario, it might be easier to interpret the results if you do this with a reversal of your dummy variable.
If these are true 0 values on a continuous scale with actual values near 0, then it might make more sense to use a flexible regression spline or generalized additive model to capture the potential nonlinearity.
If these aren't true 0 values but represent values below some detection limit, see this page for suggestions on how to handle left-censored predictor variables.
- 92,183
- 10
- 92
- 267
-
The cities that have sewage coverage in the dataset have inputs ranging from >0 to 100. Just to be certain, can you explain a bit more about what you mean with "make the corresponding continuous value meaningless"? Also, does the interpretation of the coeffcient on the continuous variable remains the same? – llb1706 Apr 15 '22 at 17:07
-
@llb1706 see the linked answer on loan values: if there's no loan, then loan value is essentially meaningless. If "nonexistent" coverage means a city without a sewer system, then it's meaningless to discuss how much its sewage system covers. If you set your dummy to 0 for "nonexistent" sewage coverage, then the intercept represents a situation with "nonexistent" sewage coverage. The sum of the intercept and the dummy coefficient is the extrapolation to 0 coverage among cities with sewer systems. The slope for coverage is the extra outcome per unit increase in coverage, for cities with sewers. – EdM Apr 15 '22 at 17:54
-
So it might be better to set the dummy to zero where sewage coverage is nonexistent, and to 1 if it exists. Thanks! – llb1706 Apr 15 '22 at 18:12