1

I am trying to create a classification model with independent variables IV1, IV2 and IV3 and dependent variable DV (DV ~ IV1 + IV2 + IV3).

Now the problem that I am facing is that IV2 exists only when IV1 takes a certain value. For example IV1 may be whether a person owns a house or not. IV2 would then be the size of the house in square metres if IV1 is true. If not, it is not applicable.

The current approach that I am using to tackle this problem is to let IV2 to be always 0 whenever IV1 is false. I find that this approach is not satisfactory as it essentially turns IV2 into a mixture of continuous and discrete variable and introduces too much bias regardless of what statistical model I use.

Suppose DV represents marriage status. If IV1 is true (i.e. the person owns a house), then the chances of that person being married increases with IV2 (size of house). So if IV2 is near 0 and IV1 is true, it is extremely unlikely that the person is married. However, if IV1 is false (the person doesn't own a house), then there is still a good chance that the person might be married. But because I set IV2 to be 0, my model keeps predicting that such a person isn't married.

So my question is, how can I better handle such a problem where a continuous variable exists only when a discrete variable takes a certain value?

Thanks. (and the house-marriage analogy is fictional of course!)

amoeba
  • 104,745
user22119
  • 619
  • 5
  • 11
  • How have you determined that such an approach introduces bias? Shouldn't make any difference whether you're setting IV2 to zero or 999 or anything else - it'll be compensated for in the coefficient estimate for IV1. See here – Scortchi - Reinstate Monica Apr 08 '15 at 16:08

0 Answers0