1

I have a classification problem that deals with a big dataset with various categorical variables of multiple levels and the RF and XGBoost even deep learning cannot work better than 60% -70% in accuracy.

Since the classes of the response are quite unbalanced (one class only account for 7%, the other two are more than 40% respectively), even after I balance the classes with oversampling, the methods still do not work well with the class that has quite a few observations, thus I thought it may be due to the original features I used.

While I haven't found a good solution to do feature selection in terms that so many predictors are categorical and only a few are continuous.

Can somebody offer some advice in terms of feature selection specific to categorical predictors (if there is any such methods), or any advice in terms of improving the accuracy. Thanks!

EmLp
  • 55
  • 3
    Why do you believe that an accuracy better than 60%-70% is possible for your problem? – Matthew Drury May 09 '18 at 16:20
  • Actually I'm not very sure about that, maybe because people talk about the excellent performance of XGBoost and Deeplearning, while I got even worse accuracy for the two methods compared with RF ( 81 to the most for RF and only 60% -70% for XGBoost and Deeplearning), which makes me think it may be due to the hyperparameter tuning of the two algorithms. I'm not an expert of the parameters of XGBoost and Deeplearning, while I'm trying my best for different combinations, still no obvious improvement in predictive performance... – EmLp May 09 '18 at 17:06
  • (especially for the class with the 7% obs, sensitivity is only around 30% while for the class with 40% obs, sensitivity can be 83% ). Some post (https://shiring.github.io/machine_learning/2017/03/07/grid_search) suggests, (quote) "that hyper-parameter tuning can only improve the model so much without overfitting. If you can’t achieve sufficient accuracy, the input features might simply not be adequate for the predictions you are trying to model.." Thus, I think maybe it is due to the original features I used (or I haven't found the appropriate combination of parameters for the ML algorithm). – EmLp May 09 '18 at 17:06
  • When you talk about performance, are you talking test set or train set? Take train set performance as an upper bound for test set performance. Also, what are the dimensions of your data? – Jim May 09 '18 at 17:58
  • Hi, Jim, I mean test set performance. The response has 3 classes which are unbalanced, and there are around 20 predictor variables (3 continuous and the others are categorical). – EmLp May 09 '18 at 18:02
  • Please edit and include the train set performance (aka in-sample performance). 2) Dimensions: How many cases (aka observations)? How many variables (aka features)? 3) Do you think the predictors (aka regressors) should reasonably be able to predict the outcome? 4) What is the test set performance per outcome class?
  • This info will help a lot in answering your question.

    P.S. please include a tag (like @EmLp). It pings me that there is a reply.

    – Jim May 10 '18 at 11:41
  • I would review https://stats.stackexchange.com/questions/222179/how-to-know-that-your-machine-learning-problem-is-hopeless and consider whether it is possible to derive more informative features, or collect more/higher quality data, or use a model that is more appropriate for your specific task. – Sycorax Jul 03 '18 at 16:13