I have data where we have longitudinal data for several users, so we have a case of repeated measures. The plan is to apply some classification or regression model. While there are other models suited to this like MLMs and GEEs, I'm interested in using predictive ML methods like the SVM, but within-subject correlations must be taken care of. It seems there isn't a way to explicitly tell the SVM that a set of measurements belong to a single user, other than not having the same user in the training/test/validation set. Is this correct?
1 Answers
There are quite a few other options besides the ones you mention. I tried to roughly classify the options I'm aware of (I'm sure there's other things one can do):
- Feature representations for IDs:
- Embeddings
- encodings based on individual's features (e.g. target encoding based on personal history etc.)
- A lot of these come down to training some explicit (or implicit) model and then using its feature representation as an input to another model. The first model could be a neural network, autoencoder, UMAP or just a GLMM/MMRM. You can even do this in a proper statistical inference setting, if you bootstrap the whole process, but usually this is mostly done for prediction purposes.
- Model internal representations for individuals:
- embedding layer for individuals (which has in many Kaggle competitions been a key to winning the competitions/getting close to it like the famous Rossmann Store Sales example that popularized embeddings layers for high-cardinality categorical features - also enormously popular for recommender systems)
- of course, random effects are a bit like 1-D embeddings
- There's various proposals to modify existing algorithms like Random Forest using random effects (see e.g. this old question and this one and this one).
- What works for random forest (where trees get averaged) may not work the same way with XGBoost/LightGBM/etc., because trees get added together with a weighted sum.
- For SVM, there seems to be some work to explicitly incorporate things like that using Fisher kernels
- Reflecting it in what the model predicts:
- Either a time series of data for an individual or all the data for the individual at once (mostly something neural networks can be made to do, tree based models tend to not be so great with multivariate output), if data are partially missing, don't incur loss for those data points no matter what the model predicts.
- Reflecting individuals in how the models are trained
- As you already mentioned, splitting for validation/testing in the way that you will really get to make predictions in practice (i.e. if you want to predict for new individuals, then you should not have some data from the same individuals in more than one of training-, validation- or test-set).
- E.g. for Random forest/boosted trees, where you bootstrap data/subsample data you could only ever subsample whole individuals (not implemented in the major libraries).
The question is of course how much difference it will make. As far as I am aware the answer is that it depends on what you are trying to do and the specifics situation. Explicitly reflecting correlation in the model or how the model outputs things will definitely matter a lot for inference (e.g. getting confidence intervals with good operating characteristics), but may not always matter for making good point predictions. However, we also know that things like good representations for high cardinality features (like embeddings individual ID) can help a lot for making predictions better and one can probably find circumstances where even with low cardinality it matters a lot. The one good thing is that with an appropriate training-/validation/test-splitting setup, one can evaluate how different approaches fare.
- 32,022
-
Thank you for your answer. The links were very helpful! I think I may go with GPBoost https://github.com/fabsig/GPBoost – irene Mar 12 '23 at 01:58