6

I have a dataset with predicted variable having two classes; true and false. 99.99% of the values are with false class. In this case, no-information rate is 99.99%. So, any model that I build needs to have an accuracy higher than the no-information rate.

It is very difficult to beat the no-information rate. In such a case, will having a model of a accuracy of 70-80% be of any value at all? If not, what are my options to improve the accuracy of my model? I have tried various techniques such as oversampling the minority class, undersampling the majority class and SMOTE, but it's hard to beat the as-is accuracy.

add787
  • 223

3 Answers3

5

This is a strong argument why you should never use a discontinuous improper accuracy scoring rule. It should also be a clue that any scoring rule that tempts you to remove data from the sample has to be bogus. If you were truly interested in all-or-nothing classification then just ignore all the data and predict that an observation is always in the majority class. Better would be to develop a probability model (e.g., logistic regression) and use a proper accuracy score to assess the model's value (logarithmic probability scoring rule = deviance = log-likelihood = pseudo $R^2$ for this purpose; Brier score).

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
2

In highly skewed data sets beating the default accuracy can be very difficult and the ability to build a successful model may depend on how many positive examples you have and what the goals of your model are. Even with a very strong skew building reasonable models is possible, as an example the ipinyou data set has approx 2.5 million negative examples and only a few thousand positive ones.

With a skewed dataset such as the ipinyou, using training using the AUC can help as this looks at the area under the ROC curve and so predicting only one class doesn't improve the score. Other challenges that can be faced using such datasets is the size, so ensuring you can actually process the data is important and may effect the language (Python, R, etc) you use, where the processing takes place (computer or on the cloud), and what algorithms you try to work with. Linear methods may struggle with highly skewed data where as non-linear methods such as random-forest or XGboost can be much more effective.

Considering careful feature engineering is also important, also sparse matrices and 1 hot vector encoding may help you uncover the patterns within highly skewed data.

1

If I understand the question correctly, it should also be mentioned that a lot of the "standard" model statistics are meaningless on the test set as you have probably only applied the imbalance adjustment techniques on the training set. In this case, as @Jonno Bourne pointed out, the AUC would be a better accuracy measure.

Triamus
  • 111