Generalizable model with GPBoost

Question

I'm back with another question regarding my earlier asked question: Making binary prediction with GPBoost (or MERF)

The goal of this project is to predict injuries, for which I only have a small subset of athletes available. This is a longitudinal study, since there are multiple individuals who have multiple observations. The model can be trained using a GPBoost/MERF algorithm. The training set contains x athletes and the test set contains y athletes. The athlete's ID will be the random effect. Optimally, my test set won't have an ID column (the random effect), because the goal is to make one model for the whole 'population'. Is it still possible to make predictions or does the test set NEED an ID column? If it's required, can I assign a random number to the test set ID in that case (like 999)? In this github link: github.com/fabsig/GPBoost/blob/master/python-package/gpboost/… the default for group_data_pred is None

If something is not clear, let me know, then I can edit the question.

Thanks in advance! Kind regards, Olivier

Random intercepts are not likely to induce the correct within-person correlation structure. — Frank Harrell, Jul 30 '23 at 13:12

score 0 · Answer 1 · answered Jun 13 '22 at 13:50

0

Yes, this is possible in GPBoost. You can simply assign a number (e.g, -1) which has not been observed in the training group_data / ID to the group_data_pred argument in the predict function. The predicted random effects will then be zero. Also, for the GPBoost algorithm, the predictions of the fixed and random effects are returned separately when setting pred_latent=True in the predict function. See this Python example for an example. You can then verify that all your predicted random effects (random_effect_mean) are indeed zero.

answered Jun 13 '22 at 13:50

fabsig

191

1

Isn't that the wrong way to predict? If you want to predict for new individuals, should this not be done integrating over the random effects distribution? Obviously, with a linear scale this would not make a difference for the point prediction (but would for prediction intervals), but if you work, say, on the logit scale for binary outcomes, this would make a difference. Right? – Björn Jun 13 '22 at 14:26
1

When having non-Gaussian data, this is what is being one in GPBoost when you predict the response variable (pred_latent=False). See the companion article https://ieeexplore.ieee.org/document/9759834 – fabsig Jun 14 '22 at 16:51
Thanks for clarifying. – Björn Jun 14 '22 at 17:14

Generalizable model with GPBoost

1 Answers1