1

I am using R package mgcv to model spatio-temporal variation in an ecological variable (y). I have data for "y" from across 8 years, and from >2000 locations across the United Kingdom. I added uncorrelated predictors to explain the spatio-temporal variation in "y" using the following model (where Tmax, Rain, Wind and Frost are environmental variables recorded for the x,y locations, and the rest are percentage land cover of the respective category at that location; ALAN is the intensity of light pollution):

gam7 <- bam(y ~ s(Year, bs = "tp") +
              s(x, y) +
              ti(x, y, Year,  d = c(2,1)) +
              s(Tmax) + s(Rain) + s(Wind) + s(Frost) + s(Arable) + 
              s(Mountain..heath.and.bog) + s(Saltwater) + s(Freshwater) + 
              s(Coastal) + s(Built.up.gardens) + s(ALAN) + s(Elevation) + 
              s(Woodland) + s(Grassland),
            data = df, family = "gaussian", discrete = TRUE, select = TRUE,  
            nthreads = 6)

In the image below you can see the x,y locations that are used to build the above model shown as black points on an elevation map enter image description here

You will notice that some locations are off-the coast, and the "Coastal" covariate is assigned 100 in those categories. The model diagnostics : enter image description here

and this is the model output:


Parametric coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.8946 0.2164 31.85 <2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms: edf Ref.df F p-value
s(Year) 4.915 6 28.670 <2e-16 *** s(x,y) 629.085 699 67.410 <2e-16 *** ti(x,y,Year) 36.837 49 23.424 <2e-16 *** s(Tmax) 20.328 24 44.886 <2e-16 *** s(Rain) 8.855 49 5.112 <2e-16 *** s(Wind) 5.843 99 0.712 <2e-16 *** s(Frost) 14.902 34 4.684 <2e-16 *** s(Arable) 29.199 99 2.862 <2e-16 *** s(Mountain..heath.and.bog) 11.376 99 1.572 <2e-16 *** s(Saltwater) 6.667 34 1.311 <2e-16 *** s(Freshwater) 7.047 34 1.412 <2e-16 *** s(Coastal) 4.800 24 4.657 <2e-16 *** s(Built.up.gardens) 10.795 99 0.455 <2e-16 *** s(ALAN) 4.945 24 1.370 <2e-16 *** s(Elevation) 87.439 99 13.808 <2e-16 *** s(Woodland) 27.048 99 1.634 <2e-16 *** s(Grassland) 41.934 99 2.566 <2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) = 0.469 Deviance explained = 47.3% fREML = 1.5914e+05 Scale est. = 0.6031 n = 134598

I then stacked the rasters for all the covariates included in the model (for all years of the original df). I converted the raster stack into a dataframe using raster::as.data.frame(stacked_df, xy = TRUE). And finally, I used the mgcv::predict function to predict "y" for these data:

predictions <- predict(gam7, new_df, type = "response", se.fit = TRUE)

I then plotted the resulting spatio-temporal maps of "y" and they look like this:

This is what I get: enter image description here

The maps look suspicious, simply because all locations which were used to build the original model (gam7), look peculiarly dark (low "y"), and all the surrounding regions are lighter (higher "y").

I am trying to understand what could cause something like this? What are the things I can check in my model/data to make sure this is not spurious in any way? (And I think that it is spurious and not ecologically meaningful, because when I removed a particular cluster of locations from the south from gam7, and then re-did the predictions - the darker regions from around those clusters vanished, and the region turned yellow on the viridis scale)

I think adding the s(x,y) takes care of the spatial autocorrelation as explained here

I think this is more of a conceptual issue in my understanding, rather than a 'coding' issue so I have not included the data and/or reproducible example. But if people feel that it might help in anyway, please let me know - I will add examples for a smaller subset.

Mansi
  • 41

0 Answers0