I have a classification problem of 8-classes, which are extremely imbalanced. The input dataset consists of sequences, each of length n features, where n = 19. For each of the 8 classes, I have a prior knowledge which subset of the n features are significantly affect the correct prediction of each of the classes. Each feature of the n features can be significant for class(es) and insignificant for the other class(es). So for each class there is a set of k-features, which significantly affect the correct prediction of that class, where 1 ≤ k ≤ n. For example, the right half side features including the middle feature positively affects the correct prediction of class number 5, where the left side features excluding the middle negatively affect the correct prediction.
The objective of my experiment is to prove that some of the features are more significant than others for specific class prediction. To prove that, I turned the multiclass classification problem into a binary classification problem where class number '5' relabled '1' to represent the positive class (1) and all the remaining '7' classes relabled '0' to represent the negative class (0). Then I trained the neural network twice: (1) to prove that the right half side features including the middle feature positively affects the correct prediction of class 1, I masked/cleared all the left side features excluding the middle for all input sequences. Then, I trained the NN and received high accuracy for the predictions on class 1 (overall accuracy: 0.90641 with accuracy of 0.93612 and 0.83985 for class 0 and class 1 respectively). (2) to prove that the left side features excluding the middle negatively affects the correct prediction of class 1, I masked/cleared all the right half side features including the middle feature for all input sequences. Then, I trained the NN and received a very low accuracy for the predictions on class 1 (overall accuracy: 0.79095 with accuracy of 0.90885 and 0.52678 for class 0 and class 1 respectively).
Since the dataset is imbalance, accuracy isn’t the appropriate metric to use as a measure for the classifier performance, the appropriate metric should be the f1-score. But I have a doubt about choosing f1-score as a metric, since that I have masked/cleared all the left side features excluding the middle or masked/cleared all the right half side features including the middle feature in all the input sequences whether it’s correct label is class 1 or not. Since the masked features are positively or negatively affect the prediction of class 1 in particular, there is no guarantee about its effect on class 0,
my first question: what is the best metric to this problem? Whether it should be F1 or recall?,
my second question: whether the appropriate metric should be measured: (1)-overall the confusion matrix as a score (i.e., Recall-score= (class 0 recall + class 0 recall)/2; or F1-score= (class 0 F1 + class 0 F1)/2); (2)-or for the target class 1 only (i.e., if the appropriate metric is the recall for class 1 only, recall of class 1 will be computed over the second row of the confusion matrix including FN and TP only; and if the appropriate metric is the F1 for class 1 only, F1 will be computed using the FN, TP and FP only)?
my third question: if F1-score overall confusion matrix is the appropriate metric, whether it should be macro-F1 (i.e., give equal weights to each class) or weighted-F1 (i.e., weight the F1-score of each class by the number of samples from that class)?
The reason for this confusion is that I provided for class 1 all the features it needs to be correctly classified or misclassified, but since the masked features may be significantly affect the correct prediction of class 0, I think the process I followed is intentionally biased toward measuring class 1, and doesn’t care about the prediction of class 0. In other words, the features that help class 0 to be correctly predicted may be completely or partially absent because of the class 1 mask that I applied. Therefore, different prediction errors have different implication. Predicting class 1 as 0 is likely to have a different cost than predicting class 0 as 1, which in turn may have an effect on choosing the appropriate metric inorder to choose the best epoch performance for that problem.