2

I have a dataset that is a high dimensional imbalanced dataset. The dataset is a categorical data set and I applied label encoder to transfer categorical values into numerical values. the dataset is a tabular dataset. I also use the mean imputation method to impute missing values. I use the oversampling technique on the training set and got the prediction recall around 0.800 for logistic regression. I used other classifiers like Naive bayes, random forest but did not get such a high prediction accuracy.

I used weka software for data training.

My question is why I got good accuracy for logistic regression and not for other classifiers?

Thank you.

Encipher
  • 175
  • Are you measuring your performance on the training data or a separate test set? – Dave Oct 29 '22 at 23:52
  • I am measuring my performance on test data. – Encipher Oct 30 '22 at 00:55
  • How does the performance compare on the training data? – Dave Oct 30 '22 at 00:57
  • I did not test the performance of the training dataset. In weka software, you can open the training dataset using preprocess tab, do the oversampling on the training dataset and supplied test dataset in the classify tab. Then run the classifier. You will get the performance measure on the test dataset. – Encipher Oct 30 '22 at 01:01
  • 1
    "Unbalanced" data is not a problem, and oversampling will not address a non-problem. Recall suffers from the exact same problems as accuracy. I would recommend that you first rethink your evaluation measures and ideally go with probabilistic predictions, assessing these using proper scoring rules and possibly calibration diagrams. I do not think you can trust recall on an oversampled dataset to tell you whether one model is better than another one. – Stephan Kolassa Oct 30 '22 at 07:42
  • Can you please elaborate why I can not trust on recall for a resample data? What types of metrics I need to consider instead of recall? – Encipher Oct 31 '22 at 04:37
  • 1
    @Encipher Did you look at the linked material? // Recall is iffy for resampled data because you are giving the model an incorrect prior probability, so the posterior probabilities and derived quantities (like recall) are affected. – Dave Oct 31 '22 at 05:09

1 Answers1

0

This is speculation, since your comments mention that you do not have access to the training performance, but with poor out-of-sample performance by fancy models, this sounds like a classic case of overfitting.

Yes, a model like a random forest allows much more flexibility in the modeling than a vanilla logistic regression, but this runs the risk of fitting the model to coincidences in your data (noise) instead of genuine trends (signal). In the extreme, think about playing connect-the-dots.

When you go and apply this overfitted model to new data, the predictions have little to do with the true trend, and your predictive accuracy is poor.

Balancing the ability to fit complicated trends while guarding against overfitting is really the key to doing good machine learning work.

Dave
  • 62,186
  • Can you please elaborate what you tried to mean? What should be my step wise approach for predicting the performance of the models like logistic regression or random forest? – Encipher Oct 30 '22 at 01:18
  • What do you mean? You already have the performance. – Dave Oct 30 '22 at 01:20
  • I have doubts. To evaluate the performance of a model - 1st do I need to find out the performance on the training set? Then I need to supply the test set and find the performance on the test set? Now, if I use only training set for performance evaluation do I need use k-fold cross validation? Also, if I think on pandas data frame we split the dataset using train test splitting and then go for the model training. At that time, we predict the result on the test set. am I right in this concept? – Encipher Oct 30 '22 at 01:30
  • Please post that as a separate question. – Dave Oct 30 '22 at 01:32