I am currently building a xgboost model to classify data into 6 categories of risk for insurance policies. I have 5 years worth of policy holder data, including policy year. When building a GLM for a similar model I have controlled for policy year and removed the variable's effect. My current ideas for controlling this variable are:
A.) In the final model pre-set that variable, so when making predictions, the model is predicting as if every record is just from 2021 (or whatever base level).
B.) Stratify the data into sub groups of policy year (per this post) and pool the results to get a single answer.
Is there a formal methodology for controlling a variable in a tree based model? Is this a big concern when working with a tree based model?