I try to learn which transformations are better for model and I am trying to compare models that I build. The first model is
Call:
lm(formula = log(medv) ~ log(crim) + zn + log(indus) + chas +
log(nox) + log(rm) + log(age) + log(dis) + log(rad) + log(tax) +
log(ptratio) + log(black) + log(lstat), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-0.95001 -0.10118 -0.00198 0.10961 0.82680
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3504375 0.4336744 12.337 < 2e-16 ***
log(crim) -0.0314413 0.0111790 -2.813 0.005112 **
zn -0.0011481 0.0005828 -1.970 0.049410 *
log(indus) 0.0037637 0.0224508 0.168 0.866935
chas 0.1011952 0.0362298 2.793 0.005423 **
log(nox) -0.3659159 0.1074552 -3.405 0.000715 ***
log(rm) 0.3843709 0.1094673 3.511 0.000487 ***
log(age) 0.0410625 0.0223547 1.837 0.066833 .
log(dis) -0.1438053 0.0356083 -4.039 6.24e-05 ***
log(rad) 0.0949062 0.0220954 4.295 2.10e-05 ***
log(tax) -0.1759806 0.0477668 -3.684 0.000255 ***
log(ptratio) -0.5895440 0.0912645 -6.460 2.52e-10 ***
log(black) 0.0532854 0.0126549 4.211 3.03e-05 ***
log(lstat) -0.4186032 0.0258019 -16.224 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1988 on 492 degrees of freedom
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7636
F-statistic: 126.5 on 13 and 492 DF, p-value: < 2.2e-16
Second Model is
Call:
lm(formula = medv ~ log(crim) + zn + log(indus) + chas + log(nox) +
log(rm) + log(age) + log(dis) + log(rad) + log(tax) + log(ptratio) +
log(black) + log(lstat), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-13.3551 -2.5733 -0.2924 2.0704 22.8158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.449e+01 9.307e+00 8.004 8.74e-15 ***
log(crim) 7.002e-02 2.399e-01 0.292 0.770524
zn -1.257e-04 1.251e-02 -0.010 0.991983
log(indus) -8.557e-01 4.818e-01 -1.776 0.076366 .
chas 2.480e+00 7.775e-01 3.190 0.001514 **
log(nox) -1.160e+01 2.306e+00 -5.030 6.90e-07 ***
log(rm) 1.374e+01 2.349e+00 5.850 8.98e-09 ***
log(age) 8.034e-01 4.798e-01 1.675 0.094658 .
log(dis) -6.327e+00 7.642e-01 -8.280 1.17e-15 ***
log(rad) 1.972e+00 4.742e-01 4.158 3.78e-05 ***
log(tax) -4.277e+00 1.025e+00 -4.172 3.57e-05 ***
log(ptratio) -1.357e+01 1.959e+00 -6.927 1.35e-11 ***
log(black) 1.005e+00 2.716e-01 3.701 0.000239 ***
log(lstat) -9.654e+00 5.537e-01 -17.433 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.266 on 492 degrees of freedom
Multiple R-squared: 0.7904, Adjusted R-squared: 0.7849
F-statistic: 142.7 on 13 and 492 DF, p-value: < 2.2e-16
The difference between models is only log transformation of dependent variable. When I compare I saw that residual standard error is very high in the second model but R-squared is also high in the second model. I did not understand which model is better. The high reduction in the standard error is due to log transformation of dependent variable or not?
medvscale the first model is multiplicative and the second one is additive. These models are not even similar. Since the residual standard errors are not on the same scale, you can't compare them. Also, due to the large number of predictors you are probably overfitting and should test for multicollinearity. – Roland Sep 17 '15 at 11:40