GAM-modelling: Accounting for spatial autocorrelation in large datasets

Question

Let me start by saying that I am rather new to GAMs and using the mgcv package, so let me know if this question has already been addressed thoroughly in this forum. I am aware that there already exist a number of posts addressing the same issue, e.g.

Why does including latitude and longitude in a GAM account for spatial autocorrelation?

My question relates on how to correctly account for spatial autocorrelation when fitting GAMs. I am currently building a model GAM model to describe house prices. The dataset is a collection of roughly 200K house sales in a particular geographic region. Relevant variables are:

i) The sales price $P_{i}$ for each data point.

ii) The coordinates of each house as (x,y)-coordinates.

iii) A range of other variables $var_{1}$, $var_{2}$…. describing the house such as its age, condition, size etc.

My strategy, which seems to be widely used in the literature on this subject, is to predict the prices using a hedonic regression model of the form:

$$ P(x,y,var_{1}, var_{2}, ..) = s(x,y) + s(var_{1}) + s(var_{2}) + … $$

My question concerns the geographic spline s(x,y), which I include to account for modelling the spatial autocorrelation in my dataset. The issue is that in order to achieve a converged gam-model (as measured by the gam.check() routine), I find that a rather large number of basis functions is required (around k=5000), which is numerically unfeasible. What would be the best strategy to tackle this issue? In the thorough introduction to GAMs by Gavin Simpson https://www.youtube.com/watch?v=sgw4cu8hrZM he mentions that in cases such as this, it might not be worth it to pursue a perfectly converged GAM-fit but rather opt for alternative methods to address the spatial autocorrelation in the dataset. Could anyone elaborate on this point?

Thanks a lot

What do you mean by "converged" here? I doubt you actually mean it's correct meaning, as there must be something very odd if a simple model wouldn't converge. Using 5000 basis functions doesn't seem completely outlandish with that many observations (and you should be fitting with bam() with this many data) but as I've mentioned here a lot, gam.check() isn't an infallible test and spatial or temporally structured data can throw it off bc of the way it works. That said, it typically does tell you that you might have unmodelled correlation, so you may need models that aren't spatially smooth. — Gavin Simpson, Feb 16 '24 at 14:33
Als o, you've tagged this with [tag:gamm4], is there a reason for this? What kind of model are you trying to fit? Please show your code for the model fit. — Gavin Simpson, Feb 16 '24 at 14:37
Thank you for the answer. The gamm4 was an autofill mistake from simply trying to tag it with gam. Converged was the wrong word to use. I meant simply that the model uses a sufficient number of basis functions to represent the spatial part. I have considered the bam() option but I had the impression that in that case one sacrifices some predictive accuracy. — August Edwards, Feb 16 '24 at 15:30
A follow-up question: I have seen approaches where a GAM-model such as the one above is combined with a more local approach such as KNN, and where one in this case does not need to use a very large number of basis functions. The reasoning behind this, from my understanding, is that the KNN takes care of the spatial autocorrelation on short length scales (i.e. the very wiggly part), while the gam describes the longer-range and less wiggly part. Does this make sense from your point of view? — August Edwards, Feb 16 '24 at 15:41
Re bam() that only happens with discrete = TRUE and even then it typically only has a very small effect. I should say that using a TPR smooth (default) is going to be slow with large data. So consider using bs = "cr" on the s() terms and te(x,y) for the spatial smooth. — Gavin Simpson, Feb 17 '24 at 19:02
A problem with the GAM / k-NN approach or similar ones will be one of identifiability; if you don't limit the wiggliness of the smooth part and limit the length scale on the k-NN part, the model could not be identifiable, or you can get into weird situations where one of the two processes gets all the variation. Something to consider is the new NCV smoothness selection method in mgcv, which can allow you to estimate the wiggliness by leaving out neighbours of obervations. It's very new and I've not tried it yet myself, but it might help with estimating the smooths. — Gavin Simpson, Feb 17 '24 at 19:06

GAM-modelling: Accounting for spatial autocorrelation in large datasets

0 Answers0