1

I am building binary classification models to fit my data and do prediction. Some of the variables have many levels, like the workclass variable below. So I am considering combine some levels into one. For example, I can combine "Federal-gov", "Local-gov", and "State-gov" into one level called "gov". However, I have two questions here.

  1. On what circumstance should I combine levels? How can I know it probably will improve my model?
  2. If I think doing so will be useful, how can I know that some levels are similar enough so I can combine them? Do I do any tests? How to check the similarity?

Thanks.

> summary(census$workclass)
     Federal-gov        Local-gov     Never-worked          Private     Self-emp-inc Self-emp-not-inc 
             960             2093                7            22696             1116             2541 
       State-gov      Without-pay             NA's 
            1298               14             1836 
Evan Liu
  • 105
  • 7

0 Answers0