0

Consider a binary classification problem, there are $1000$ samples in the data set, of which $500$ positive and negative samples each. Positive samples have the label $1$ and negative samples have the label $-1$.

However, the quality of this data set is very low. Many of the samples in the positive samples are incorrectly labeled, while the samples in the negative samples are all correctly labeled.

I use $80\%$ of the samples in this dataset for training and the remaining $20\%$ for testing. Using soft-margin SVM, the final accuracy is only about $80\%$.

It can be seen that for such a low-quality data set, the accuracy of the model trained by the supervised learning method on the test set is not high. In fact, I have the following questions:

  • If I want to improve accuracy, is semi-supervised learning the best approach? Is there any other way?
  • Which method of semi-supervised learning should I use? S3VM? S4VM? or something else?
  • If using semi-supervised learning, how can I manipulate the existing data set?

  • Another avenue to consider is re-doing the labeling on the 500 positive samples. – dipetkov Mar 30 '22 at 18:28
  • @dipetkov For some particular reason I can't relabel the samples, so what method should I use? – 3029 serity Mar 31 '22 at 06:53
  • I don't have advice about complex methods to apply on very low quality data. I'll be wondering whether it is time well spent. – dipetkov Mar 31 '22 at 07:21
  • A post that's often suggested in response to a question about ML methods producing poor results is this one. I don't imply your problem falls in the "hopeless" category. – dipetkov Mar 31 '22 at 07:27

0 Answers0