Let me start by saying that I am rather new to GAMs and using the mgcv package, so let me know if this question has already been addressed thoroughly in this forum. I am aware that there already exist a number of posts addressing the same issue, e.g.
Why does including latitude and longitude in a GAM account for spatial autocorrelation?
My question relates on how to correctly account for spatial autocorrelation when fitting GAMs. I am currently building a model GAM model to describe house prices. The dataset is a collection of roughly 200K house sales in a particular geographic region. Relevant variables are:
i) The sales price $P_{i}$ for each data point.
ii) The coordinates of each house as (x,y)-coordinates.
iii) A range of other variables $var_{1}$, $var_{2}$…. describing the house such as its age, condition, size etc.
My strategy, which seems to be widely used in the literature on this subject, is to predict the prices using a hedonic regression model of the form:
$$ P(x,y,var_{1}, var_{2}, ..) = s(x,y) + s(var_{1}) + s(var_{2}) + … $$
My question concerns the geographic spline s(x,y), which I include to account for modelling the spatial autocorrelation in my dataset. The issue is that in order to achieve a converged gam-model (as measured by the gam.check() routine), I find that a rather large number of basis functions is required (around k=5000), which is numerically unfeasible. What would be the best strategy to tackle this issue? In the thorough introduction to GAMs by Gavin Simpson https://www.youtube.com/watch?v=sgw4cu8hrZM he mentions that in cases such as this, it might not be worth it to pursue a perfectly converged GAM-fit but rather opt for alternative methods to address the spatial autocorrelation in the dataset. Could anyone elaborate on this point?
Thanks a lot
bam()with this many data) but as I've mentioned here a lot,gam.check()isn't an infallible test and spatial or temporally structured data can throw it off bc of the way it works. That said, it typically does tell you that you might have unmodelled correlation, so you may need models that aren't spatially smooth. – Gavin Simpson Feb 16 '24 at 14:33bam()that only happens withdiscrete = TRUEand even then it typically only has a very small effect. I should say that using a TPR smooth (default) is going to be slow with large data. So consider usingbs = "cr"on thes()terms andte(x,y)for the spatial smooth. – Gavin Simpson Feb 17 '24 at 19:02