3

I have a dataset with 3 features of interest. Within the boosting (and specifically XGBoost) framework, if I want to account for all possible interactions between the features, does this need to be included in the input matrix X? Or would I simply input the 3 features of interest, and XGBoost would consider the interactions.

I have searched into this, but most of the results are regarding feature interaction constraints. But the fact that you can impose such constraints makes me think I do not need to explicitly include the interaction columns between features.

pdhami
  • 335
  • 1
    Questions solely about how software works are off-topic here, but you may have a real statistical question buried here. You may want to edit your question to clarify the underlying statistical issue. You may find that when you understand the statistical concepts involved, the software-specific elements are self-evident or at least easy to get from the documentation. – kjetil b halvorsen May 04 '22 at 14:14

1 Answers1

7

In theory, tree based models like gradient boosted decision trees (XGBoost is one example of a GBDT model) can capture feature interactions by having first a split for one variable and then one for the other variable in some of the trees the model consists of (the leaf node values of which then get added together to make a prediction, so by having different combinations of splits in the two variables across trees, you can with enough trees approximate most functions).

However, if you suspect/believe particular transformations of features (e.g. interactions or some more complex functions of multiple features) to be important, it makes it much easier for the model to figure this out, if you provide these transformed features already as a new feature. The process of coming up with good new features is called "feature engineering" and can in some cases make a huge difference.

When will it make the biggest difference?

  • When there's not so much data so that it is hard for the model to "figure out" the right splits to approximate the transformed feature (without overfitting given that it has a huge space of possible interactions of multiple features that it can try).
  • When the transformation is complex and not well approximated by step-functions (especially if splits of several variables in the same tree are ideally needed to get a good approximation = strong high-dimensional interactions, which a step function in just one or two variables gets badly wrong).
  • When we have strong theoretical (e.g. due to geometry or physics of the situation) or human intuition (e.g. for predicting whether someone is going to buy something when visiting a website, let's say we have previous sales for the customer and previous visits to the website by the customer as features, it is very intuitive for a human to form a feature like sales per visit) reasons to use a particular feature.
Björn
  • 32,022