Consider a binary classification problem, there are $1000$ samples in the data set, of which $500$ positive and negative samples each. Positive samples have the label $1$ and negative samples have the label $-1$.
However, the quality of this data set is very low. Many of the samples in the positive samples are incorrectly labeled, while the samples in the negative samples are all correctly labeled.
I use $80\%$ of the samples in this dataset for training and the remaining $20\%$ for testing. Using soft-margin SVM, the final accuracy is only about $80\%$.
It can be seen that for such a low-quality data set, the accuracy of the model trained by the supervised learning method on the test set is not high. In fact, I have the following questions:
- If I want to improve accuracy, is semi-supervised learning the best approach? Is there any other way?
- Which method of semi-supervised learning should I use? S3VM? S4VM? or something else?
- If using semi-supervised learning, how can I manipulate the existing data set?