Is there any reason to factor in sample weights when applying a scoring function to a test set?

Question

It's my understanding that sample weights are used to ensure that each observation used to train a machine learning model are given a weight corresponding to its perceived importance/value to the model. We would normally pass these sample weights to the sample_weight arg of an sklearn estimator's train() method.

However, if we are to use our model to predict on the unseen data of our test set, our sample weights would be irrelevant, as evidenced by the fact that the many estimators in the sklearn library have no "sample_weight" argument for their predict() methods.

So is there any point in ever passing one's sample weights to the sample_weight argument of a scoring function (ie. precision_score(), recall_score(), etc.) being applied to a test set and the predictions it outputs? It seems that doing so would be providing insight gained unfairly in hindsight to score a model and thus give inflated, overly optimistic scores of its performance, though maybe there is some utility to doing this that I'm not seeing?

score 5 · Accepted Answer · answered Feb 24 '20 at 03:12

It is often quite beneficial to pass sample weights to a training function or for scoring on a test data set; there is no point in passing them to a scoring function used for predictive purposes. The former can occur in several situations, for example:

Each training record can represent one or more identical observations. In this case, we'd want the weight to equal the number of observations corresponding to the training record. Our test data records will presumably also have records that can represent one or more observations, and this should definitely be taken into account when calculating the out-of-sample performance statistics.
Prior knowledge may tell us that the variability of the target variable given the features differs significantly across training records. Sometimes this is handled automatically, e.g., with a logistic regression, but sometimes not, e.g., with a least squares objective. In this case, we'd like the weights to be proportional to the inverse of the conditional variance of the target variable, or at least to a rough guesstimate thereof. (See "weighted least squares" for a proof of this.) One could perhaps argue the point, but in this case it seems to me the weights should also be applied to the test data set, otherwise our metrics may be heavily skewed by observations which we have little hope of predicting well, and our modeling effort will likely end up being similarly focused on those observations - even though we know, apriori, that we can't expect to predict them well.
The sample may not be representative of the population which is the intended target of the modeling effort. In this case, we'd likely want to increase the weight associated with undersampled subsets of the population. This is done frequently in sample survey analysis, where it may be costly to construct an appropriate stratified sample but post-hoc adjustments are easy to do. Obviously, this reasoning carries over to scoring on a test data set as well.
See Ben Reiniger's answer!

When scoring takes place for predictive purposes, though, there is no point in weighting any more. The training has already happened. Each record to be scored is scored individually; the results are not combined (across records) into a performance statistic. If I gave the record a "weight" of, say, $2.5$, what would the scoring algorithm do with it? It is just calculating the prediction based on the feature values in the record; the prediction will be the same regardless of whether the record represents, for example, $3$ observations or $1$, or whether the error associated with the prediction is likely to be large or small.

score 4 · Answer 2 · answered Feb 24 '20 at 02:48

I think the point is to allow (at least an approximation of) cost-sensitive metrics: setting weights according to the impact that row has on business value.

It seems that doing so would be providing insight gained unfairly in hindsight to score a model and thus give inflated, overly optimistic scores of its performance...

If you're using weights in the scorers to balance your dataset, I agree. But this is why the scorers have sample_weights rather than class_weights. (Of course, that won't stop some people from trying to use sample_weights just to balance classes, e.g. a comment on https://stackoverflow.com/a/54094871/10495893.)

https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics
https://github.com/scikit-learn/scikit-learn/issues/15651

Is there any reason to factor in sample weights when applying a scoring function to a test set?

2 Answers2