0

I have fitted my data to a linear model in R using generalized least squares accordingly:

mod<-gls(Y ~ A * B+Z, 
                         data)

and used weights due to heteroscedasticity as follows:

mod_varpower<-update(object = mod,weights=varPower())

and then I run ANOVA (first question here, lets say A and B are factors and Z is a continuous variable, then I have basically done an ANCOVA right?)

Anova(mod_varpower) # Analysis of variance 

which gives me the following output:

> Anova(mod_varpower) # Analysis of variance 
Analysis of Deviance Table (Type II tests)

Response: Y Df Chisq Pr(>Chisq)
A 2 4.9937 0.08234 . B 2 7.3418 0.02545 * Z 1 6.3873 0.01149 * A : B 4 7.9339 0.09403 .


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >

As we can see here I don't have a significant interaction between A and B, according to Engqvist 2005 I should therefore ditch the interaction effect. However, now to my question. If I do look into contrasts of contrasts I get the following:

emmip(mod_varpower, A ~ B) # look at interactions

emm_s.t <- emmeans(mod_varpower, ~A*B) # look at contrasts

emm_s.t > contrast(emm_s.t, interaction = "pairwise") A_pairwise B_pairwise estimate SE df t.ratio p.value Metabolite - Normal Metabolite - Normal -0.171 1.63 140 -0.104 0.9170 Metabolite - Low Metabolite - Normal -2.036 1.42 172 -1.430 0.1545 Cotton - Low Metabolite - Normal -1.865 1.43 186 -1.305 0.1934 Metabolite - Normal Metabolite - Low -2.448 1.49 138 -1.639 0.1035 Metabolite - Low Metabolite - Low -3.246 1.33 185 -2.441 0.0156 Cotton - Low Metabolite - Low -0.798 1.39 183 -0.574 0.5666 Metabolite - Normal Cotton - Low -2.277 1.47 183 -1.550 0.1229 Metabolite - Low Cotton - Low -1.211 1.37 156 -0.881 0.3796 Cotton - Low Cotton - Low 1.066 1.45 167 0.733 0.4646

Degrees-of-freedom method: satterthwaite

where the 5th row from above shows a p-value below 0.05 which means we have a significant interaction if I'm not mistaken?

So on which basis can we remove the interaction term as proposed by Engqvist 2005? Should I go for the ANOVA output or the contrast of contrasts?

EDIT

If I look at the correlation the two are highly correlated and linear to my understanding. I must mention that stats is not my strong side, some questions might be a bit obvious.

> Mod<-cor.test(dat$Z,dat$Y, method = "spearman",use="complete.obs")

> Mod

Spearman's rank correlation rho

data: dat$Z and dat$Y S = 1131356, p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.6047263

enter image description here

Blanca
  • 45

1 Answers1

4

It's risky to make decisions based on the arbitrary p < 0.05 test for "significance." Also, once you start doing that, you get into apparent inconsistencies like what you show.

It's certainly possible for one specific contrast to pass that test even if an overall evaluation doesn't. It's also possible to have an overall test indicate that there are "significant" differences among scenarios while no particular pairwise comparisons are "significant." That has much less to do with the underlying reality and more to do with the different numbers of degrees of freedom used in overall versus specific tests, multiple-comparison corrections, and the arbitrary p-value cutoff.

The decision has mostly to do with what you are trying to accomplish with the model. For a predictive model you typically want to include as many terms as reasonable without overfitting. If that's your purpose in modeling, you would probably include an interaction with a p value of 0.09 (which isn't that far from "significant").

If you are trying to produce a simple model that adequately describes your results, then you might remove it. For example, Frank Harrell suggests pre-specifying interactions when starting to develop a model, but then removing or keeping all interactions based on a global test. See Section 4.12.1 of his Regression Modeling Strategies, with an example of such a choice in Section 21.3.

You have modeled Z as linearly related to outcome. That's a pretty strong assumption and might be affecting the apparent "significance" of your other terms if the true relationship isn't linear. It can help to model continuous predictors flexibly, for example with regression splines. That might be the ultimate solution to your problem.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • 1
    Exactly! Data analysis is not an algorithm where you can follow some flow chart that says "if this, do this". You have to think.

    A surprising number of people object to this.

    – Peter Flom Jul 08 '23 at 13:26
  • As an answer to what I'm trying to do, I'm trying to produce a simple model that describes my results

    As for the Z. I've added an edit with a correlation plot for Z and the rho value. As you mentioned in the last paragraph, Z being linearly related to Y is a strong assumption but if I have the correlation plot and literature that shows that Z have a strong positive influence on Y.

    Although A and B could also influence Z. Therefore I want to make sure that what I see influencing Y from A and B is actually due to A and B and not Z. I don't know if this makes sense to you.

    – Blanca Jul 08 '23 at 18:30
  • @Blanca what you are trying to do makes sense. You clearly have to adjust for Z in the model. My suggestion is to try something more flexible than a simple linear adjustment for it. The Spearman correlation test you report is non-parametric, only evaluating the ranks of observations. It doesn't indicate linearity. See this page and its links. Even if you did a Pearson test for linear correlation, a curvilinear association of Z with outcome could still have a strong Pearson correlation coefficient. – EdM Jul 08 '23 at 21:33