Machine Learning Classification with Variables Summing to One

Question

Suppose you have data with a bunch of predictors, and some of these predictors are proportions that add up to one. An example would be data like the following

gender  perc_shop   perc_game perc_stud   age
M       .23         .71       .06         31
F       .47         0         .53         19
F       .05         .31       .64         29

The variables in columns 2-4 all add up to one, so in logistic regression it would be necessary to remove one as a baseline variable. However, in building a classification model using machine learning methods (i.e. decision trees, random forests, svm, etc.) would it be necessary to remove one of the variables?

Sycorax · Answer 1 · 2022-12-06T20:53:45.460

Compositional data has the quality that the sum of the columns is 1. In a regression setting (e.g. OLS, logistic regression), this can cause problems because when the design matrix includes an intercept, it is not full-rank. As a result, the model is not identified and there are arbitrarily many solutions of equal quality. Dropping the intercept or one of the proportion columns, or employing regularization such as ridge regression, are some solutions.

None of the listed methods (decision trees, random forest, SVM) require a full-rank design matrix to be estimated. SVMs act on the kernel matrix, while decision trees simply seek out good splits.

For tree-based models, dropping a redundant column can cause side-effects, such as causing the model to seek out deeper trees. This answer discusses an example in the context of gradient-boosted trees One hot encoding of a binary feature when using XGBoost

It can be desirable to transform the composition data to a new representation. Here's a discussion. How to perform isometric log-ratio transformation

Machine Learning Classification with Variables Summing to One

1 Answers1