2

I have been doing a little research about the several cross validation methods but there is an issue that remains as a doubt for me in the Monte Carlo method. Let's suposse I have 2,000 data points. 500 with target 1 and the others 1,500 with target 0. So what will be the correct procedure with balanced train-test datasets?

  1. Choose randomly 500 negative cases and over that fixed subset (500 target 1, 500 target 0) run each iteration of the cross validation.
  2. Choose randomly 500 negative cases in each iteration and then do the random train-test split.

I think the second option is the best for it is not limitated to the first sample like in the option 1. There is a sort of literature that address this issue? How other cross validation methods face this issue with the k-folds being fixed over an initial sample for instance?

Ben Reiniger
  • 4,514
  • 1
    I'm having trouble understanding your two procedures. Can you make the process a little more explicit? – Eli Jun 15 '23 at 19:49
  • 2
    It is unclear what a "negative case" is. And I'm sorry, but I don't know what you mean by "over that fixed subset". – Gregg H Jun 15 '23 at 23:42
  • To clarify for others, "negative cases" seems to refer to observations whose target is labeled as 0. The first procedure is a bit under-specified, but I think I understand the spirit (see the last paragraph of my answer). The second procedure just means randomly dropping 1000 negative observations and running CV / train-test-split on the rest of the observations. – chicxulub Jun 16 '23 at 15:52

1 Answers1

1

There's typically no need to balance (read: sacrifice) positive vs. negative observations anywhere in the evaluation procedure. If you want to ensure that the scoring metric is balanced—in that it weighs positive and negative cases equally, regardless of their distribution in the real world—then perform balancing in the scoring function, not the data. That way, you'll have a balanced metric which makes use of all observations.

Here's an example which applies to scoring metrics which can be decomposed as a sum of errors over individual observations. Partition the set of validation observation indices $D = \{1, 2, \dots, n \}$ into positively labeled observations and negatively labeled ones: $D_{+}, D_{-}$, respectively.

$$ \begin{align} \text{plain error} &= \frac{1}{|D|} \sum_{i \in D} \text{err}(\hat{y}_i, y_i) \\ \text{balanced error} &= \frac{1}{2} \Bigg( \frac{1}{|D_{+}|} \sum_{i \in D_{+}} \text{err}(\hat{y}_i, y_i) \Bigg) + \frac{1}{2} \Bigg( \frac{1}{|D_{-}|} \sum_{i \in D_{-}} \text{err}(\hat{y}_i, y_i) \Bigg). \end{align} $$

For an example of re-weighing a slightly more complex metric, see my answer here. It performs re-weighing for the Area Under the Precision-Recall Curve (AUPRC).

It's important to keep in mind that balanced scores are not always desirable. For example, if your dataset is a large random sample and has class imbalance, and the costs of poor predictions are equivalent across classes, then a balanced scoring metric is wrong; it over-emphasizes the smaller class because it's less prevalent, not because it's more important. The weights of observations during evaluation is determined completely by costs related to the model's application, not their class' prevalence in the data distribution. Here's a question containing links and answers speaking to this principle.

But say you know that you want a balanced score, and it's difficult to modify the scoring metric to achieve equal weighing over classes. In this case, you need to balance data. From the whole sample, the validation split should be (randomly) subsampled. Say that this validation split contains $n$ positive observations and $m$ negative observations, and $n < m$. Then randomly subsample $n$ of the $m$ negative observations, and score the metric on these $2n$ observations. This procedure seems to be option (2) in your question. The benefit of option (2) over option (1) is not that fewer observations are dropped; in both options, 1000 observations won't be evaluated on. The benefit is that you don't disturb the empirical distribution of classes in the training split. In option (1), the training class distribution will be biased wrt the 1:3 ratio.

chicxulub
  • 1,420