Selecting most important variable based on individual p-value vs. partial $R^2$

Question

I'm trying to solve a problem where the goal is to find an association between children's cortisol values (y) against their mother's weekly cortisol averages (x1 to x6) and gender (z). After conducting model selection strategies in R (all subset regression, backward elimination, etc.), the following two 'optimal' models were found:

y = a0 + a5*x5

and

y = a0 + a1*x1 + a3*x3 + a4*x4 + a5*x5 + a6*x6

For the second model, I found something interesting:

x3: p-value = 0.16, partial $R^2$ = 37.1%
x5: p-value = 0.04, partial $R^2$ = 5.5%

(Let's ignore other variables for now – their p-values and partial $R^2$ values were between these two variables.)

[Note: p-value refers to the probability / significance of the variable not being equal to zero; partial $R^2$ to the percent of variation that cannot be explained in a model that doesn't contain the variable]

Now to my question: Why do I see x5 being much more significant to the model than x3, yet dropping x3 from the model will drop my $R^2$ values a lot (from around 20% to about 5%), but not so much for x5? Is the reason collinearity between all the variables in the model (which does exist)? Or is it something else?

Also, the ultimate goal is to find the most important variable describing the response. Would I choose x3 or x5 in this case, and why? Or can such a choice be made?

Your dilemma is false. Partial correlation is just another way to standardize regression coefficient b, and they share the same p-value. http://stats.stackexchange.com/a/76819/3277. — ttnphns, Mar 23 '14 at 06:18

score 3 · Answer 1 · answered Mar 23 '14 at 03:31

You leave out a lot of details, you mention stepwise regression and other things and come up with 2 models, but you don't tell us which models your partial R-squareds and other things come from.

One thing that you are seeing is some of the problems that come from automated model selection procedures. Think about the assumptions that go into calculating p-values and the other statistics that you computed. It turns out that p-values computed after stepwise regressions majorly violate the assumptions and quickly become meaningless. A lot of statisticians are also moving away from trusting R-squared values, partial R-squared would have even more problems (and would vary widely depending on what other terms are being adjusted for).

What is more important than comparing p-values is to work out what question or questions you really want to answer. Some questions are best answered using the full model without selection, others by comparing specific nested models rather than finding a dubious "Best" model. If you really want to simplify a model the expert opinion these days is leaning much more towards penalized methods rather than stepwise selection. Look at ridge regression, lasso, and elasticnet methods.

You might also want to simulate data where you know the "Truth", then run the various methods on the simulated data to see how the estimates compare to the generating truth. This can help understanding of what is going on, what works well, and the limitation of methods that are supported more by tradition and inertia than by being the "best".

Sorry, I tried to keep it concise without giving too much detail.. The p-values and partial R-squared values come from the second model (with five variables). But thank you for the suggestions; I'll look into them for sure (ridge regression and so on). But why would partial R-squared have more problems than R-squared? Because of the other potential adjustments that are made? — JCB, Mar 23 '14 at 21:47

score 2 · Answer 2 · answered Mar 23 '14 at 12:35

As Greg and others have said, variable selection creates a host of serious problems. One way to view this is that you are asking more of the data than the data are capable of telling. There is an analogy between data torture and human torture. With torture, the data will tell you anything you want to hear, true or not. This is especially true with the predictors are correlated. To me, the best way to expose the problem is to bootstrap the ranks of the predictors in order to get confidence intervals for the ranks. Variables can be ranked on a number of metrics, including pairwise and partial $R^2$. See Section 5.4 of Handouts under http://biostat.mc.vanderbilt.edu/CourseBios330 . You will see on most of the datasets you encounter that importance of many of the predictors is highly uncertain - the data may only be able to tell you that the apparent winning predictor is not among the 5 worst predictors.

score 1 · Answer 3 · answered Mar 24 '14 at 13:33

Stepping back a bit from the mechanistic workings of regression, and looking at the 'information economics' here, I see several missed opportunities in your approach to the data, JCB. Most importantly, the models you discuss seem to ignore the longitudinal structure of the data. (The variables x1..x6 are sequenced in time and grouped by mother.) Also, your use of averages makes me hope that maybe you had daily measures, but arbitrarily grouped and averaged them week-wise as a variable reduction approach. If by any chance you actually have not 6 but 7*6 cortisol measures per mother, then you have an opportunity to learn from some fairly rich time-series data spanning the menstrual cycle, and to formulate and test some pointed scientific hypotheses about causal mechanisms. What is the scientific question that motivates your search for an association?

Selecting most important variable based on individual p-value vs. partial $R^2$

3 Answers3