Dropping one of the columns when using one-hot encoding

Question

My understanding is that in machine learning it can be a problem if your dataset has highly correlated features, as they effectively encode the same information.

Recently someone pointed out that when you do one-hot encoding on a categorical variable you end up with correlated features, so you should drop one of them as a "reference".

For example, encoding gender as two variables, is_male and is_female, produces two features which are perfectly negatively correlated, so they suggested just using one of them, effectively setting the baseline to say male, and then seeing if the is_female column is important in the predictive algorithm.

That made sense to me but I haven't found anything online to suggest this may be the case, so is this wrong or am I missing something?

Possible (unanswered) duplicate: Does collinearity of one-hot encoded features matter for SVM and LogReg?

you end up with correlated features, so you should drop one of them as a "reference" Dummy variables or indicator variables (these are the two names used in statistics, synonymic to "one-hot encoding" in machine learning) are correlated pairwisely anyway, be they all k or k-1 variables. So, the better word is "statistically/informationally redundant" instead of "correlated". — ttnphns, Aug 23 '16 at 14:09
The set of all k dummies is the multicollinear set because if you know values of k-1 dummies in the data you automatically know the values of that last one dummy. Some data analysis methods or algorithms require that you drop one of the k. Other are able to cope with all k. — ttnphns, Aug 23 '16 at 14:09
@ttnphns: thanks, that makes sense. Does keeping all k values theoretically make them weaker features that could/should be eliminated with dimensionality reduction? One of the arguments for using something like PCA is often to remove correlated/redundant features, I'm wondering if keeping all k variables falls in that category. — dasboth, Aug 23 '16 at 14:24
Does keeping all k values theoretically make them weaker features. No (though I'm not 100% sure what you mean by "weaker"). using something like PCA Note, just in case, that PCA on a set of dummies representing one same categorical variable has little practical point because the correlations inside the set of dummies reflect merely the relationships among the category frequencies (so if all frequencies are equal all the correlations are equal to 1/(k-1)). — ttnphns, Aug 23 '16 at 15:21
What I mean is when you use your model to evaluate feature importance (e.g. with a random forest) will it underestimate the importance of that variable if you include all k values? As in, do you get a "truer" estimate of the importance of gender if you're only using an is_male variable as opposed to both options? Maybe that doesn't make sense in this context, and it might only be an issue when you have two different variables actually encoding the same information (e.g. height in inches and height in cm). — dasboth, Aug 23 '16 at 15:31
Classification trees / random forests, as I'm aware, can take categorical predictors directly and handle them properly. You don't have to one-hot recode them, do you? As in, do you get a "truer" estimate of the importance of gender No, not truer. Same. — ttnphns, Aug 23 '16 at 16:12
That's a good point about the trees. Thanks for your answers. My final question: Some data analysis methods or algorithms require that you drop one of the k - which ones in particular? — dasboth, Aug 24 '16 at 08:48
which ones in particular? The immediate one example - linear regression. If you want to predict Y by categorical factor(s), such as gender and race, and you don't have or don't want to use a specialized program such as ANOVA which recognizes categorical factors, - you will need to input the predictors in linear regression as the sets of dummy variables. And, because each set is collinear inside (which fact regression won't tolerate) you'll have to drop any one dummy from every set. — ttnphns, Aug 24 '16 at 10:50
you can still keep all one-hot dummies, but drop the intercept instead — rep_ho, Feb 05 '20 at 22:31
Most of the answers here address a regression setting, but the particular behavior of dummies can depend on the model in question. The question has the [tag:machine-learning] tag, so it's worth addressing models beyond regression. An example of how categorical features can behave in gradient-boosted methods like xgboost: https://stats.stackexchange.com/questions/438875/one-hot-encoding-of-a-binary-feature-when-using-xgboost/439191#439191 — Sycorax, Feb 05 '20 at 22:51
This website provides explanations with examples when the additional columns should be dropped or not https://inmachineswetrust.com/posts/drop-first-columns/ — LifeJadid, Jun 24 '21 at 08:58
I have performed OHE on a copy of df for my Lasso plots, KDE plots, Matrix plots, Heatmap plots. The dataframe copy which I used for these plots is called df1. Now prior to modeling I'm selecting X,y duringtrain_test_split. Should X,y be derived from df or df1? What about the columns that got dropped as reference columns during OHE? What's the convention in this kind of case? How can I make it more manageable? — Edison, Jul 02 '22 at 11:51

score 50 · Answer 1 · edited Nov 10 '22 at 16:41

This depends on the models (and maybe even software) you want to use. With linear regression, or generalized linear models estimated by maximum likelihood (or least squares) (in R this means using functions lm or glm), you need to leave out one column. Otherwise you will get a message about some columns "left out because of singularities"$^\dagger$.

But if you estimate such models with regularization, for example ridge, lasso er the elastic net, then you should not leave out any columns. The regularization takes care of the singularities, and more important, the prediction obtained may depend on which columns you leave out. That will not happen when you do not use regularization$^\ddagger$. See the answer at How to interpret coefficients of a multinomial elastic net (glmnet) regression which supports this view (with a direct quote from one of the authors of glmnet).

With other models, use the same principles. If the predictions obtained depends on which columns you leave out, then do not do it. Otherwise it is fine.

So far, this answer only mentions linear (and some mildly non-linear) models. But what about very non-linear models, like trees and randomforests? The ideas about categorical encoding, like one-hot, stems mainly from linear models and extensions. There is little reason to think that ideas derived from that context should apply without modification for trees and forests! For some ideas see Random Forest Regression with sparse data in Python.

$^\dagger$ But, using factor variables, R will take care of that for you.

$^\ddagger$ Trying to answer extra question in comment: When using regularization, most often iterative methods are used (as with lasso or elasticnet) which do not need matrix inversion, so that the design matrix do not have full rank is not a problem. With ridge regularization, matrix inversion may be used, but in that case the regularization term added to the matrix before inversion makes it invertible. That is a technical reason, a more profound reason is that removing one column changes the optimization problem, it changes the meaning of the parameters, and it will actually lead to different optimal solutions. As a concrete example, say you have a categorical variable with three levels, 1,2 and 3. The corresponding parameters is $\beta_, \beta_2, \beta_3$. Leaving out column 1 leads to $\beta_1=0$, while the other two parameters change meaning to $\beta_2-\beta_1, \beta_3-\beta_1$. So those two differences will be shrunk. If you leave out another column, other contrasts in the original parameters will be shrunk. So this changes the criterion function being optimized, and there is no reason to expect equivalent solutions! If this is not clear enough, I can add a simulated example (but not today).

I can agree that choice of reference variable will affect the outcome of regularized regression, but I am not very sure if leaving all variables as-is is better than dropping one. Do you have some reason for that? — Kota Mori, Jan 09 '19 at 23:38
Well, only two alternatives ... Keeping all the levels keep the invariance and there is no arbitrrariness. If you have other reaons ro want to reduce the number of levels, such as too many of them, you should tell us about that — kjetil b halvorsen, Jan 09 '19 at 23:47
Can you please add the simulation and also show point to some reading links on how iterative method in L1 and L2 does matrix inversion. — MAC, Jul 02 '20 at 03:19
I totally agree with your point about interpretation change, which gets more complicated when including interactions. Sorry but IMHO this is a statement without clear evidence (i.e. an opinion). Please refer to an exhaustive study/peer reviewed paper (using several benchmark datasets) to prove your point regarding performance deterioration. — cs0815, Aug 26 '20 at 06:44

score 4 · Answer 2 · edited Jun 11 '20 at 14:32

In chapter 5 of this book Feature engineering for machine learning has an example can illustrate kjetil's answer.

City Rent
0 SF 3999
1 SF 4000
2 SF 4001
3 NYC 3499
4 NYC 3500
5 NYC 3501
6 Seattle 2499
7 Seattle 2500
8 Seattle 2501

One-hot encoding:

San Francisco      1     0     0
New York              0     1     0
Seattle                  0     0     1

Dummy encoding (drop one column):

San Francisco      1     0     0
~~New York              0     1     0~~
Seattle                  0     0     1
Results:
                                    NYC       SF       SE       b
One-hot encoding 166.67 666.67 –833.33 3333.33
Dummy coding             0        500   –1000     3500

I was not aware one-hot was different than dummy encoding, I thought they were the same! Thanks for pointing this out. Some Googling lead me to this similar question which covers the same topic https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn — Espen Riskedal, May 19 '21 at 16:12

Dropping one of the columns when using one-hot encoding

2 Answers2

Linked

Related