5

I just ran a non metric multidimensional scaling model (nmds) which compared multiple locations based on benthic invertebrate species composition. After running the analysis, I used the vector fitting technique to see how the resulting ordination would relate to some environmental variables.

My first question is: I got an R squared value of .18 for a variable representing "depth" at a site. It has the highest R square value of any environmental variable by far (the rest are less than 0.07) and is the only variable which is significant at the alpha=0.05 level. I'm wondering if there are any thresholds for how much variance explained, is a good amount. I know it might depend on the scientific discipline, I've heard things like 30% for ecology, but I can't track down a reference for this.

My second question is, am I right in saying that for this case the % of variance explained, is the % of variance in the depth data explained by the nmds ordination axes? Thanks!

Jimj
  • 1,183
  • 1
  • 14
  • 25

1 Answers1

8

I wouldn't place much stock in "rules of thumb" such as this. It is dependent upon so many things such as the number of variables, the number of sites, what dissimilarity you use etc. Also note that the vector fitting approach is inherently linear and we have no reason to presume that the relationship between the variable and the NMDS configuration is linear.

The key thing is that it is a small/modest correlation but that it is significant. But you probably want to look at the linearity assumption.

In the vegan package for R we have function ordisurf() for this. It fits a 2-d surface to an NMDS solution using a GAM via function gam() in package mgcv. It essentially fits the model

$$g(\mu_i) = \eta_i = \beta_0 + f(nmds_{1i}, nmds_{2i})$$

where $f$ is a 2-d smooth function of the 1st and 2nd axes of the NMDS, $g$ is the link function, $\mu$ the expectation of the response, and $\eta$ is the linear predictor. The error distribution is a member of the exponential family of distributions. The function $f$ can be isotropic in which case we use smooths formed by s() employing by default thin plate splines. Anisotropic surfaces can be fitted too, where we use smooths formed by te(); tensor product smooths.

The complexity of the smooth is chosen during fitting using a, by default in the development versions, REML criterion. GCV smoothness selection is also available.

Here is an R example using one of the in-built data sets provided with vegan

require("vegan")
data(dune)
data(dune.env)

## fit NMDS using Bray-Curtis dissimilarity (default)
set.seed(12)
sol <- metaMDS(dune)

## NMDS plot
plot(sol)
## Fit and add the 2d surface
sol.s <- ordisurf(sol ~ A1, data = dune.env, method = "REML", 
                  select = TRUE)
## look at the fitted model
summary(sol.s)

This produces

> summary(sol.s)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x1, x2, k = knots[1], bs = bs[1])
<environment: 0x2fb78a0>

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.8500     0.4105   11.81 9.65e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
           edf Ref.df     F p-value  
s(x1,x2) 1.591      9 0.863  0.0203 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =   0.29   Deviance explained =   35%
REML score = 41.587  Scale est. = 3.3706    n = 20

and

enter image description here

In this case a linear vector fit seems reasonable for this variable. Read ?ordisurf for details on the arguments used, especially what select = TRUE does.

Gavin Simpson
  • 47,626
  • what if the variables turn out not to be linear? What are the next steps? (1. move forward with caution, 2. drop from analysis, 3. transform, 4. other?...) – theforestecologist Jan 10 '23 at 22:51
  • 1
    @theforestecologist if it isn't linear, then you can just use the GAM fit returned by ordisurf(), which is an object of class "gam" as fitted by the gam() function in package mgcv. The summary() method for those objects returns information on the statistical; significance of the estimated surface, so I would report that in place of the output from envfit() for the non-linear variables. If you want to move on to fit a constrained ordination using a non-linear effect of the covariate on the multivariate response, that becomes more difficult but could be done with splines also – Gavin Simpson Jan 11 '23 at 09:58
  • Thanks. Two follow-up questions: (1) is the R2 from summary.ordisurf comparable to the R2 from envfit? -- in other words, can I place them in the same table column showing strength of correlation for both linear and non-linear variables together? (2) is there a quick, non-visual way to determine if a relationship is non-linear without needing to plot and observe all variable contours of interest? (e.g., something from the summary output that might be used as a codable quantitative measure? ) – theforestecologist Jan 17 '23 at 21:11
  • @theforestecologist 1) the R2 is not directly comparable because the one in the GAM is an adjusted R2, accounting for the complexity of the model fit. 2) the EDF column (edf) would do that; without select = TRUE, a linear surface would have 2 effective degrees of freedom (EDF), where the EDF > 2 the surface is non-linear. When select = TRUE this is not true however and it makes the EDF less easily interpretable in the manner you would like. If the surface is non-linear, you should be able to see it in the contour plot, too – Gavin Simpson Jan 17 '23 at 22:42
  • is there a way to convert the adjusted R2 to be comparable to the envfit R2? – theforestecologist Jan 17 '23 at 22:47
  • perhaps a difficult subjective question to address: but what would be considered too nonlinear to instead use envfit? Graphically speaking this is quite difficult to describe I assume, but what about regarding the edf approach as explained in your last comment? Do you think a hard cutoff at edf=2 is appropriate, or can an edf quite a bit higher be potentially "linear enough" to indicate vector fitting with envfit to instead be appropriate? For example this plot seems fairly linear overall, but it has an edf of 7.387. Thoughts? (thanks!) – theforestecologist Jan 17 '23 at 22:59
  • and final question: envfit conveniently provides scores for each ordination axis, while ordisurf instead provides summary output for the gam model overall. Is there a way to extract and or translate the summary.ordisurf output into the context of individual scores for each ordination axis separately? – theforestecologist Jan 17 '23 at 23:08
  • To convert the envfit R2 into an adjusted R2, just look up the equation for it (you'll need the number of observations, and the number of predictors (2 in this case)). – Gavin Simpson Jan 18 '23 at 11:54
  • That surface you linked to is certainly non-linear - if you plotted it with vis.gam() you'd see that. There is a formal way to test if you need non-linearity but I haven't thought about how you'd actually do this with ordisurf() - if you are interested, ask a new Question on this site and link to it here and if you can provide an example (like the one in my post I will show you how to do it by hand using mgcv directly) – Gavin Simpson Jan 18 '23 at 11:57
  • As for scores, you mean the estimated coefficients which give rise to a vector. The equivalent is to predict the value of the response at a grid of values over the range of your covariates, and then plot those predicted values. That would likely be best as a Question on [so] rather than here, but if you ask one there, link to it here and I'll answer it – Gavin Simpson Jan 18 '23 at 11:59
  • thanks. new linked posts: (1) https://stats.stackexchange.com/questions/602401/formal-way-to-test-if-a-non-linear-approach-is-necessary-to-correlating-environm and (2) https://stackoverflow.com/questions/75162757/is-there-a-way-to-extract-and-or-translate-summary-ordisurf-output-to-view-the-r – theforestecologist Jan 18 '23 at 17:48