I want to select models using regsubsets(). I have a dataframe called olympiadaten (data uploaded: http://www.sendspace.com/file/8e27d0). I first attach this dataframe and then start to analyze, my code is:
attach(olympiadaten)
library(leaps)
a<-regsubsets(Gesamt ~ CommunistSocialist + CountrySize + GNI + Lifeexp +
Schoolyears + ExpMilitary + Mortality +
PopPoverty + PopTotal + ExpEdu + ExpHealth, data=olympiadaten, nbest=2)
summary(a)
plot(a,scale="adjr2")
summary(lm(Gesamt~ExpHealth))
screenshot of the plot:
The problem is now, that I want to fit the best model again "manually" and have a look at it, but the value of the adjusted R squared is not the same as in the regsubsets output? This is also the case for the other models, e.g. when I do the simplest model in the graphic:
summary(lm(Gesamt~ExpHealth))
The graphic says, it should have an adjusted R squared of about 0.14, but when I look at the output, I get a value of 0.06435.
Here is the output of summary(lm(Gesamt~ExpHealth)):
Call:
lm(formula = Gesamt ~ ExpHealth)
Residuals:
Min 1Q Median 3Q Max
-18.686 -9.856 -4.496 1.434 81.980
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.0681 6.1683 -0.497 0.6203
ExpHealth 1.9903 0.7805 2.550 0.0127 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 18.71 on 79 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.07605, Adjusted R-squared: 0.06435
F-statistic: 6.502 on 1 and 79 DF, p-value: 0.01271
I don't know what I might have done wrong, any help would be appreciated.
And last but not least, some more questions:
- What is the difference between selecting models by AIC and by the adj. R squared?
- Both measure the fit and recognize the number of variables, so isn't the best model chosen by AIC also the model with the highest adj. r squared?
- When I have 12 variables, this means, there are $2^12$ possibilities of models, right?
- So does the
regsubsets()command calculate each model and show the two best (nbest=2) of each size? - If so, do I really get the 'best' model?
- And when I do AIC using backwards selection (starting with the model which contains all variables), does this also end up with the same model that
regsubsets()says is the best?
but the problem is, now I get an adj. R squared of 0.009202, which is still not correct (and even more worse)?
– user1690846 Sep 26 '12 at 11:26regsubsets, so you've removed too many rows. And anyway, I don't see why you would want to replicate the adjusted $R^2$ thatregsubsetsgives, unless it's just to understand how it was obtained. – mark999 Sep 26 '12 at 11:34summary(lm(Gesamt ~ ExpHealth, data = subset(olympiadaten, !is.na(CommunistSocialist) & !is.na(CountrySize) & !is.na(GNI) & !is.na(Lifeexp) & !is.na(Schoolyears) & !is.na(ExpMilitary) & !is.na(Mortality) & !is.na(PopPoverty) & !is.na(PopTotal) & !is.na(ExpEdu) & !is.na(ExpHealth))))– mark999 Sep 26 '12 at 11:38