There's typically no need to balance (read: sacrifice) positive vs. negative observations anywhere in the evaluation procedure. If you want to ensure that the scoring metric is balanced—in that it weighs positive and negative cases equally, regardless of their distribution in the real world—then perform balancing in the scoring function, not the data. That way, you'll have a balanced metric which makes use of all observations.
Here's an example which applies to scoring metrics which can be decomposed as a sum of errors over individual observations. Partition the set of validation observation indices $D = \{1, 2, \dots, n \}$ into positively labeled observations and negatively labeled ones: $D_{+}, D_{-}$, respectively.
$$
\begin{align}
\text{plain error} &= \frac{1}{|D|} \sum_{i \in D} \text{err}(\hat{y}_i, y_i) \\
\text{balanced error} &= \frac{1}{2} \Bigg( \frac{1}{|D_{+}|} \sum_{i \in D_{+}} \text{err}(\hat{y}_i, y_i) \Bigg) + \frac{1}{2} \Bigg( \frac{1}{|D_{-}|} \sum_{i \in D_{-}} \text{err}(\hat{y}_i, y_i) \Bigg).
\end{align}
$$
For an example of re-weighing a slightly more complex metric, see my answer here. It performs re-weighing for the Area Under the Precision-Recall Curve (AUPRC).
It's important to keep in mind that balanced scores are not always desirable. For example, if your dataset is a large random sample and has class imbalance, and the costs of poor predictions are equivalent across classes, then a balanced scoring metric is wrong; it over-emphasizes the smaller class because it's less prevalent, not because it's more important. The weights of observations during evaluation is determined completely by costs related to the model's application, not their class' prevalence in the data distribution. Here's a question containing links and answers speaking to this principle.
But say you know that you want a balanced score, and it's difficult to modify the scoring metric to achieve equal weighing over classes. In this case, you need to balance data. From the whole sample, the validation split should be (randomly) subsampled. Say that this validation split contains $n$ positive observations and $m$ negative observations, and $n < m$. Then randomly subsample $n$ of the $m$ negative observations, and score the metric on these $2n$ observations. This procedure seems to be option (2) in your question. The benefit of option (2) over option (1) is not that fewer observations are dropped; in both options, 1000 observations won't be evaluated on. The benefit is that you don't disturb the empirical distribution of classes in the training split. In option (1), the training class distribution will be biased wrt the 1:3 ratio.