I was suggested that my questions were too broad. As I commented below, I have nearly a million data points and perhaps a hundred variables. This may be a very basic modeling question: I am curious to know how to start a GAM with a large dataset. I have tried the 'bam' function with a much smaller dataset, and it didn't work as what I expected. I do have access to supercomputers, but it still seems unpractical to tune a GAM with this big dataset. I was suggested to pick 8 to 10 variables and fit a GAM. Still, it is slow to run a GAM with the complete dataset. So my guess is that I need to reduce the number of variables and sample size to fit a GAM.
My original questions: I have 61 bioclimatic variables that explain different or similar aspects of insect life cycles and some of them are highly correlated. My study extent covers the North American continent and the spatial resolution is 10 km. The temporal resolution is yearly and temporal range is 20 years. This means that my dataset is huge for GAMs. I have built models using GLMs instead for prediction purpose. However, the models are complicated (e.g., 777265*263) and not easy to interpret. So I am trying to use GAMs to build small models that only include fewer variables and some percentage of samples for interpretation purpose. I followed some questions on the package 'mgcv' and found that most of the examples are using a very small number of variables. Does that mean I need to handpick the variables? I used the 'gam.selection' function with a smaller dataset (828*54), and I can see some variables are not significant in a smoothed term. I also used the 'concurvity' function to examine potential multicollinearity. Now I need some suggestions on variable selection: What are the appropriate number of variables for an explanatory GAM? Do I select the variables based on my knowledge, the 'gam.selection' results where a significant nonlinearity is detected, and the 'concurvity' results? Or what would be the most efficient way in this variable selection process? I appreciate your thoughts and timely help.
select = TRUEand using thebam()function in mgcv. – Gavin Simpson May 07 '19 at 19:26s(..., bs='cr')withingam(notbam)? – usεr11852 May 07 '19 at 22:16bam()withs(..., bs = 'cr')is probably as efficient it is going to get with the sorts of models mgcv can fit. – Gavin Simpson May 07 '19 at 22:53gam(). You want to fit a model that that shrinks all terms andselect = TRUEwill do that. Don't remove variables that are not significant; that is a hard statement that the effect is === 0. Keep them in and thesummary()outputs will reflect the uncertainty over whether something is even in the model or not. – Gavin Simpson May 07 '19 at 22:56bam()withs(..., bs = 'cr')) and see how it works. Forgive me that I am not very familiar with GAM. Do you mean that I can add the covariates in the model without examining their collinearity beforehand? – Dongmei Chen May 08 '19 at 01:23method = 'REML', select = TRUEormethod = 'ML', select = TRUEwill give you best chance of not succumbing to concurvity issues, and you can always check afterwards with theconcurvity()function that mgcv provides. – Gavin Simpson May 08 '19 at 15:17bam()on fits on the order of millions to 10s of millions of observations, I would venture that your problem is not so huge as to be impractical. What I would also venture is that you are probably going about this problem the wrong way if you are throwing all manner of 61 bioclimatic variables. Surely you can choose from among them to limit the model space first - i.e. you don't need to get it down to 8-10, but you might not want to include min, mean, and max temp for all months as variables in the model... – Gavin Simpson May 08 '19 at 15:22?bamand understood how to get the best from it; you'll wantmethod = 'fREML', select = TRUE, discrete = TRUEand you'll want to be able to use multiple cores (settingnthreadsand making sure you have multithreading capabilities in your BLAS), and make sure you ahve a lot of RAM available. See the papers cited in?bamfor indications of the problem size that this function is designed for. Then be prepared for this to take a while to fit. – Gavin Simpson May 08 '19 at 15:23