1

I apologise in advance for the long post, my questions are deeply interconnected so it felt wrong posting them as separate threads. Please note also that I edited this post to account for Florian Hartig's comments and questions.

Context and goal

I would like to build a model to predict the physical condition of a bird species in various areas of a city. EDIT: My goal is spatial interpolation, I want to estimate the species' physical condition for all woody patches of the city based on the values observed across the woody sites I sampled. I should also mention that these predictions will only be used to parametrise another model (to be weights for the nodes of a graph-based connectivity model).
My dependent variable (DV) is continuous (approximately normal). I have a fairly small sample size ($N$ ~ 420) but the good news is that I have few predictors ($p$ ~ 15) – which describe the environmental conditions surrounding the nestboxes where each bird was measured – and they're all biologically relevant.
Now, the trouble is my data has a complex structure and my observations are not entirely independent:

  • An observation is a bird's measurement in a given nestbox for a given year (i.e. $N$).
  • Observations were made over 4 consecutive years (2019-2022) and part of them were made on the same nestboxes. Therefore, while most of my observations were made on independent nestboxes (i.e. with a single measurement; $n_{single}$ ~ 240), the rest of my nestboxes are associated with 2 to 4 different bird measurements ($n_{multi}$ ~ 180).
  • Some of my nestboxes are spatially clustered but I'm not sure it's a problem because my predictors should account for this spatial autocorrelation while many observations are not clustered.

EDIT to answer Florian Hartig's first question (a): I think that I want to predict outside of my spatial structure since I want to interpolate values (i.e. predict between my sampled sites/nestboxes). For the temporal structure however, I don't really know since my data are static and have no temporal resolution (for a given nestbox, all predictors values remain the same every year), so I think I want to predict for an "average" year.


My edited questions

I'm quite new to predictive modelling so I'm trying to figure out what would be the best approach for me here in terms of model type and validation procedure, hence my questions:

  1. Given my sample size and data structure, what method should I use to evaluate the predictive performance of my future model(s): held-out test set or resampling (cross-validation or bootstrap)? EDIT (answer): I asked a more general question here and I read Frank Harrell say somewhere that hold-out testing is unreliable for $N$< 20 000, so I guess I should use either repeated-CV or bootstrap to evaluate my modelling approach.
  2. Whatever the method, how can I account for the particular structure in my data? Should I use a kind of stratified resampling? If so, any advice on how to do that using R? EDIT (answer): It seems I should use blocked-CV (or blocked-bootstrap?) to account for my data structure. The blockCV package could here be useful.
  3. Alternatively, would using a mixed modelling framework enable avoiding this daunting stratification process? I could also drop the repeated measures but I'd prefer not losing observations. EDIT (answer): theoretically yes but that may not work well.
  4. If all my observations were independent, I would not be in a high dimensionality problem but given my data structure, I don't know how to estimate my effective sample size. Should I go for variable selection (e.g. LASSO, AIC-based model selection and averaging) to obtain a better $n/p$ ratio?
  5. Actually, considering my situation, can I afford any model optimisation (e.g. variable selection or hyperparameter tuning) and associated double-bootstrap/nested-CV? This answer (or this one) makes me think that I can't yet I'm unsure as I don't know the required sample size for these procedures as it seems there's no theoretical sample size limitations to CV folds (it only affects the precision of CV estimates). EDIT: Still, if I do model optimisation, I should still use blocking to account for my structure (e.g. giving something like nested blocked-CV?). Yet that seems awfully complicated with regards to my goal.
  6. All in all, what model type and validation procedure would you choose in my place? And with a focus on which metric (, MSE, AIC, etc.)? Initially, I thought about a mixed effects Random Forest regression to maximise prediction accuracy compared to a "regular" LMM, but simpler solutions are perhaps wiser?

As you can see, there are crucial gaps in my understanding so I'm having trouble connecting the dots and deciding on the safer path to wander. I really want to learn so I purchased "Regression Modeling Strategies" by Frank Harrell as well as "Statistical Learning with Sparsity" by Hastie et al. but it will take me months to read and digest them both. Consequently, any helpful advice and resources (especially for R implementations) will be highly appreciated.

Fanfoué
  • 631
  • 5
  • 16

1 Answers1

3

OK, I will not be able to answer all your question but I will provide some thoughts

a) First of all, for structured data, you have to specify if you want to predict inside your structure, or outside. So, for example, if you have a random effect year, do you want to predict for a year in your dataset, or to a new year? Same with spatial structure and so on.

b) For all ML approaches such as bootstrap or CV, permutations and blocking should be done according to the answer to a), see also Roberts, David R., et al. "Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure." Ecography 40.8 (2017): 913-929. So, if you want to see how your model performs for predicting to a year for which you have data, use a random CV, but if you want to see how it would perform in predicting to a new year use a blocked CV.

c) If you could "guess" correctly the structure of your data, and use appropriate models (such as mixed models, spatial CAR structure etc.) you could in principle rely on the parametric inference of these models also outside your structure, without nonparametric methods such as a blocked CV, but some of our simulations in the Roberts paper suggest that this does not work reliably on real data, presumably because it is rare that you will guess the correct model structure.

d) I don't see a use for calculating effective sample size. It's poorly defined in such a context anyway.

e) Regarding model optimisation and validation: see answer a)

Response to edited question:

Hi Fanfoué, thanks for your clarifications - if you want to INTERPOLATE in space, you want to predict INSIDE your spatial structure, so no need for blocking in space, and if you want to predict for an average year, you can probably ignore time as well. If you are really only interested in predictions, just use a random forest or BRT with all environmental features + x,y as predictors, and a random CV to validate.