1

I am having a tough time as a newbie understanding the drop argument in OneHotEncoder. Does it drop the column with the non-numerical values of the categorical variables after it is done?

I am looking at the documentation and I don't get how we end up with a 2 x 3 array at the end.

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]

...

>>> drop_enc = OneHotEncoder(drop='first').fit(X) >>> drop_enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray() array([[0., 0., 0.], [1., 1., 0.]])

I was expecting 5 columns :

  • 1 column for male,
  • 1 column for female,
  • 1 column for the 1 category
  • 1 column for the 2 category
  • 1 column for the 3 category

1 Answers1

3

Implementation-wise, the drop keyword results in one category per column being dropped. Thus, out of ['Female', 'Male'], Female is dropped, and out of [1, 2, 3], 1 is dropped. This leads to the format that you've printed, but for clarity, let's write out some column names.

Suppose your original features are called "biological sex," which (for this example) is ['Female', 'Male'], and "ID", which is [1, 2, 3]. Your output is equivalent to the table:

bio. sex is Male ID is 2 ID is 3
0 0 0
1 1 0

which corresponds to ['Female', 1] and ['Male', 2] as expected. Note that the columns bio. sex is Female and ID is 1 are omitted by design.

To get the 5-column output you expect, simply remove the drop keyword.

There are many reasons why one would or would not choose to drop one feature per (original) column. In short, if all features for a column are one-hot encoded, then the encoded features are now multicollinear (i.e., a linear combination of other features), which may or may not result in ill-defined estimators (e.g., non-regularized OLS). I am glossing over an incredible amount of nuance -- I would carefully read the linked answer to learn more.