I am now reviewing a paper in which the authors decided to predict a DV through linear regression using, beyond other variables, dummy variables obtained from a tertile split of continuous variables, which were not normally distributed. In other words, for example, they took a not normally distributed variable, split the variable into three tertiles, created three dummy variables for each tertile (i imagine they assigned 1 to subject falling into the selected tertile in each variable), and put all the dummy variables in the regression model. Their regression models get a really high R^2 value (.90). Is it correct to do so?
Asked
Active
Viewed 1,032 times
1 Answers
1
You can do this (put a continuous variable into bins) but it's generally considered a loss of information.
It would be appropriate if there is clearly a different effect when moving from one bin to another or if the relationship between the IV and DV appears to be stepwise: same effect from 0 to 5, different effect from 5 to 10, etc. You would want to check out a scatter plot to see what the univariate relationship looks like for the binned IV and DV. If the relationship looks linear, I would not understand why they chose to bin the variable.
Keep in mind that the dummy variables are interpreted relative to the single dummy variable left out of the model, which would probably be the first tertile.
wcampbell
- 2,197
Did they have any better reason than that?
How many predictors were there, and what was the sample size?
– Glen_b Apr 16 '14 at 15:05