2

I have data on about 8000 persons and I am trying to find independent predictors of a health outcome variable (yvar). The predictor variables are age, gender, height, city and 3 other predictor variables (xvar1, xvar2, xvar3). Some are continuous while others are categorical. The categorical variables are kept as such and not converted to numbers (e.g. 'M' and 'F' are levels in gender). The outcome variable (yvar) is continuous.

If I use following code in R (applying all interactions):

lm(yvar~age+gender+heigth+city+xvar1+xvar2+xvar3)

I get 5 of these 7 to be with $p<0.05$ (many are much less than 0.05) and overall $R^2$ of 0.11

On using following code:

lm(yvar~age*gender*heigth*city*xvar1*xvar2*xvar3)

I get $R^2$ of 0.18 but NONE of the predictors has $p<0.05$

What do I conclude from this? Should I or should I not use interactions? What is the best way to analyze such data?

Also, should I use one of above formats or following format:

lm(yvar~(age+gender+heigth+city+xvar1+xvar2+xvar3)*(age+gender+heigth+city+xvar1+xvar2+xvar3) )

These produce only 2-way interactions and not all combination interactions as in second format.

rnso
  • 10,009
  • Terminology tip: The word "parameter" used to mean just index or measure or property or variable clashes with its established statistical use as meaning a unknown constant you are estimating. – Nick Cox Dec 09 '14 at 12:40
  • I changed from 'parameter' to 'outcome variable' in the question. – rnso Dec 09 '14 at 17:35

2 Answers2

8

The approach you are using is devoid of input from subject-matter knowledge. This is usually a recipe for trouble. Using $P$-values to guide model specification is fraught with statistical problems. And you may be using the wrong statistical tests as the basis for your concern, i.e., you may be trying to define a main effect in the presence of an interaction effect. The appropriate tests to use are combined tests of main effects + interaction effects. But more importantly having 7th-order interactions in your model is a huge overkill resulting in estimation of far too many parameters, making everything unstable. It would be better to stick to second-order interactions (in R (a + b + c)^2) but this approach is still a bit dangerous and you have not included all the original variables as main effects, which will give you a false impression of importance of interactions.

Try to think of a statistical analysis not as a fishing expedition but rather through reasoned model specification.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
  • 2
    I would add that this kind of high-order interactions would be impossible to interpret and would have a high potential for overfitting. – Tim Dec 09 '14 at 12:42
2

If I understand you correctly, you have a number of explanatory variables but not much clue which of them could be relevant and which not. Here are a few ways to proceed:

  1. Try out all possible subsets of variables and pick the one that gives a regression with the smallest Bayesian information criterion (BIC) value. See e.g here for relevant R functions. In you want to allow for interactions, too, then either define new regressors by multiplying the existing ones in a pairwise fashion or look for an existing function to do that for you.
    Using BIC will help strike the right balance between possible overfitting and underfitting. If you intend to use your model for forecasting rather than explanation, use Akaike information criterion (AIC) instead of BIC.

  2. Forward or backward stepwise selection: start from a small model and add regressors one by one based on their relevance (broadly speaking) or start from a general model and remove regressors one by one, again based on their relevance.

  3. Shrinkage methods (LASSO, ridge regression, elastic net, principal components regression, partial least squares): if you want to reduce the mean squared error of your model and do not care exclusively about the unbiasedness of your estimates, you might want to allow for some bias to gain a decrease in variance. This makes sense if you intend to forecast but not so much if your study is explanatory.

See Hastie et. al "The Elements of Statistical Learning" chapter 3 subsections 3.3-3.7 for a more detailed overview.

To address your concern about p-values: they are not sacred and many say they are given too much importance. Sometimes (when there is a lot of data) even irrelevant variables become statistically significant, but the magnitudes of their coefficients are small and substantively negligible. Sometimes two or more variables are jointly significant but not so individually. Thus you have to interpret them carefully and not just mechanically.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
Richard Hardy
  • 67,272
  • 1
    Note that some of these are more controversial than others. #2 is widely considered a very bad idea, for reasons well explained by Frank Harrell, a contributor here, in his book Regression modeling strategies (New York: Springer, 2001). I'd want to add 0. Consist subject-matter experts if you lack subject-matter expertise to consider what kinds of relationships are likely to be of interest or importance. – Nick Cox Dec 09 '14 at 17:43
  • 1
    #1 is also a very bad idea. – Frank Harrell Dec 09 '14 at 21:05
  • I tried and found bestglm very easy to use. Simple command "bestglm(mydata)" uses default options and gives me the best model. Is it fair to use in most situations? – rnso Dec 10 '14 at 11:41
  • 1
    No. It is important to fully understand the assumptions made and the performance characteristics of any method for model development. – Frank Harrell Dec 10 '14 at 11:52
  • I checked and found that in most studies interactions are not used at all. Only different independent variables are put in multiple regression using '+' (and not '*') and significant ones are reported. Is this safe to follow when interactions are not obvious? – rnso Dec 10 '14 at 12:23
  • This is where you need some subject matter knowledge. Statistical learning / machine learning techniques that I proposed are but tools that can be useful but that have to be used wisely dependent on the context. I am no expert in the area of health modelling, so I cannot advise you in this question. – Richard Hardy Dec 10 '14 at 13:58
  • Using multiple regression analysis with only'+', I am getting age, gender, height and two more variables as significant. On the other hand, if I use bestglm(mydata) only age and gender are shown in best model. What do I conclude? – rnso Dec 10 '14 at 15:26
  • Parameter significance as assessed by p-values is not the best indicator of which variables to include. This has been claimed multiple times, see e.g. here. Better look at BIC values for the different models to see which one is more relevant. – Richard Hardy Dec 10 '14 at 15:31
  • The command bestglm(mydata) gives me Df, Sum Sq, Mean Sq, F value and Pr(>F) of best model only. How do I get 'BIC values for different models'? – rnso Dec 10 '14 at 16:56
  • You should be able to calculate BIC manually given Sum sq and the number of parameters estimated in the model (you could use total number of observation minus Df to obtain those). Hint: take a look at BIC formula and find out how log likelihood can be expressed as a function of Sum sq (this holds for normally distributed errors, not sure if it holds universally). – Richard Hardy Dec 10 '14 at 19:53
  • @FrankHarrell Why is #1 a very bad idea? Is it because of (i) trying out all possible subsets of variables, or (ii) using the BIC as the measure of relevancy? – ManUtdBloke Oct 14 '20 at 10:01
  • 1
    This has been studied extensively and shown to fail. See for example my RMS course notes. Among the problems: overfitting/poor validation, ruined standard errors, bias, instability, ruined confidence interval coverage. – Frank Harrell Oct 14 '20 at 12:16
  • @FrankHarrell So how do we decide what variables/interactions to use in the model if both 1) and 2) are bad? Subject matter knowledge is not going to help in the very common situation in which we have no pre-existing knowledge of whether a particular interaction effect is present in nature and we are relying on the information in the data itself. Suppose we have the predictors - age, gender, height, weight, experience, IQ - and the response variable salary. How do we decide what interaction effects to include/not include? – ManUtdBloke Oct 21 '20 at 14:25
  • I have turned the above comment into a post as I have not found good info on this anywhere so I think it could be useful to the community in general - https://stats.stackexchange.com/questions/493027/how-to-systematically-choose-which-interactions-to-include-in-a-multiple-regress – ManUtdBloke Oct 21 '20 at 14:42
  • 2
    It's a tough problem. Throughout my career I've never found exploratory interaction searching to be close to reliable. So it remains necessary to narrow potential interactions to include in the model. Ask the question "could we explain this interaction were it to be found important?" – Frank Harrell Oct 21 '20 at 14:52
  • Thanks, its good to know that somehow narrowing the interactions is fundamentally important. And I will keep that question in mind any time I run into this type of situation in future. – ManUtdBloke Oct 22 '20 at 10:40