Why custom evaluation/scoring metric is causing overfitting in (cross) validation?

Question

I am using machine learning to approach a balanced binary classification task.

Some of rows are more important/valuable than others, so getting them right is extra important. Therefore, to accommodate this, I decided to do two things.

Sample weighting (to emphasize more important points in the training process)
Custom evaluation/scoring metric (to reflect whether model got important rows correct)

The problem is this: When I use my custom evaluation/scoring metric for purposes of model selection during cross validation, there appears to be overfitting to the validation every time. That is, the performance during cross validation (for model selection/hyperparam tuning) is considerably better than on the test set.

Context: I am doing nested cross validation. So for 5 different outer folds, am I able to compare the inner cross validation performance (vs. baseline) against the outer fold test set performance (vs. baseline).

When I score the models (in model selection and on the test set) using my metric, there seems to be a consistent and considerable worsening of performance from cross validation to test set. This does NOT happen when I use a regular evaluation metric, like accuracy for model selection and test set evaluation. Why might this be?

My evaluation metric is defined as follows: Every data point is assigned an importance score between 0 and 1, exclusive (it's never 0, and never 1). These importance scores are used as sample weights AND they are the same scores that are counted by the evaluation metric. The metric is simply the sum of all importance scores of every row predicted correctly minus the importance scores of every row predicted incorrectly, divided by the total number of rows. So, it's sort of like "average importance correctly predicted per row".

I would greatly appreciate any insight here, as I have never used a custom evaluation metric before, and I'm happy to give a bounty to an answer.

Have you considered any other regular metric than accuracy? Accuracy is a rather poor metric. Are the sample weights balanced as well (on average they are the same for both groups)? What do you do with the importance weights for the test set? I assume that in prediction time you won't be having them, so I guess, you are not using them in the test set as well? — Tim, Mar 10 '22 at 22:38
Can you please clarify how the importance weights are created? I suspect that the evaluation is inadvertently biased if the same data are used to predict "in-sample" (for training as well as validation evaluations) but then to predict "out-of-sample" (for testing evaluation). — usεr11852, Mar 11 '22 at 14:03
@Tim I have not considered any other "baseline" metric - I just wanted some point of comparison. The sample weights are pretty close to balanced for the two groups. You're right, I don't use the importance weights in any way (besides scoring) since I don't have the weights ahead of time in real time. — Vladimir Belik, Mar 11 '22 at 15:43
@usεr11852 Absolutely! The data is a financial time series, and the importance weights are based on the magnitude of the price movement (bigger moves get more weight). I'm not sure I understand your suspicion/concern, and why (in such case) this problem wouldn't appear for accuracy metrics. — Vladimir Belik, Mar 11 '22 at 15:45

Tim · Answer 1 · 2022-03-12T21:38:12.153

2

It sounds like what you want to optimize are the "importance scores". With your metric, you are verifying if your model is able to correctly classify the samples as important vs not. In such a case, why not make the "importance" your target variable? You could use as a target -1 * importance for the negative class and +1 * importance for the positive class and treat this as a regression problem (or classification, if your algorithm allows for a fuzzy target), whereas for making hard classifications, you would use some threshold over the predicted scores. In such a case, you would be directly optimizing the scores. There wouldn't be a problem with designing custom metrics because you could just use standard loss like logistic loss (for $\pm 1$ labels), squared, or absolute error (unlike the two previous ones, absolute error is not a proper scoring rule, you probably would like to stick to the proper ones).

On another hand, if you really care only about using importance weights when training and validating the results, why not just use standard weighted metrics? Weights can be introduced for standard metrics (e.g. squared error) by replacing averaging the individual errors with a weighted average.

Finally, as about the metrics themselves, accuracy is a poor metric. Accuracy doesn't care if the predicted scores are well callibrated, so it also would not be able to measure also the kind of changes that you intend to measure with your metric. In the case of standard classification, it can be the case that different metrics disagree because they measure different things, that is why the general advice is to stick to a single metric that you optimize.

edited Mar 12 '22 at 21:38

answered Mar 10 '22 at 23:15

Tim

138,066

Thank you so much for taking the time. In a sense, you're right - if I could, I would love to predict the importance scores directly. However, this is a financial dataset and given that the importance metrics are based on the magnitude of the corresponding price movement, I think doing regression here is very unlikely to yield a good result - too much noise, too hard of a problem. Let me quickly explain my logic of using these scores - they are very hard to predict directly, so I'm essentially just trying to predict the sign of the move (roughly). – Vladimir Belik Mar 11 '22 at 15:47
Since I can't predict whether the next period's importance score is high or low, I am doing all of this to make it so that IF/GIVEN next period's importance score is high, I am very likely to get the sign right (since I oversampled). IF/GIVEN next period's importance score is very low, I don't really care much if I get the direction right or wrong. I hope that makes sense. – Vladimir Belik Mar 11 '22 at 15:50
In your second paragraph, you bring up key points that I've been trying to investigate, but I didn't have the vocabulary for. At the end, I understand that accuracy vs. my custom metric can be measuring different things. However, would that explain why one metric shows validation overfitting, while the other doesn't? I would have thought given the relative nature of overfitting, it wouldn't show up only with certain metrics. – Vladimir Belik Mar 11 '22 at 15:52
Given my previous comment, I'm still confused because I use custom metric for hyperparam tuning/model selection AND I use it to measure out-of-sample performance, so I do feel like I'm sticking to one single metric that I'm optimizing. Would you disagree? – Vladimir Belik Mar 11 '22 at 15:55
1

@VladimirBelik I don't follow why can't you predict the importance * label if you care about predicting the importance? You are optimizing your algorithm to predict one thing, but validate the algorithm using completely different criteria. You are making things harder for yourself, while the solution is pretty simple, as stated above. With using importance * label as target, you are optimizing for the same criteria as measured using your custom metric. – Tim Mar 11 '22 at 16:09
I can't predict "importance x label" because the "importance" part of that is extremely hard to predict. For my practical purposes, importance is unpredictable - unlike label. Label is predictable, BUT I want to be sure that for those rows where importance is high, I am sure to predict label correctly. Therefore, I oversample and score according to importance, to get models that do well in the case where GIVEN that the importance is high, the label will be predicted accurately. Does that help explain? Since I can't predict imp, I want to do given high imp, label is predicted correctly. – Vladimir Belik Mar 11 '22 at 16:20
@VladimirBelik then your metric is wrong. Judging the effectiveness of an algorithm using a criterion that is impossible to reach doesn't sound like a reasonable metric. Moreover, you cannot have the "given" model, i.e. condition on the importance, because if I understand correctly you would not know the importance at prediction time, so your model would be missing important information. I might be lacking details, but it sounds like your intended solution is not what you want it to be. – Tim Mar 11 '22 at 16:28
I see your point, but let's take an analogy. Let's say you're predicting 1/0 whether a person has a disease. You have a dataset of young and old people, and you want your model to be extra good at predicting whether the older people have this disease or not (since it's more important to get treatment faster, if they do have it). How would you go about addressing this? You're not predicting the age of the person, and age has nothing to do with the disease. But you want your prediction to be most accurate for older people (biggest consequences). How would you approach a situation like this? – Vladimir Belik Mar 11 '22 at 16:43
Maybe this is where the analogy breaks down, but for your features, you have all kinds of data about the person, but when a new patient comes in (test set), you get their info except their age. You don't know their age ahead of time. The person could be young, they could be old, but you want it to be so that IF they are old, your prediction is more likely to be correct. Is there any way to approach this? Or am I saying crazy things, accidentally trying to "cheat the system" somehow? – Vladimir Belik Mar 11 '22 at 16:48
2

@VladimirBelik ok, so you don't want to accurately predict the importance, but use it as a weight when training, but then, why won't you use just a standard loss but weighted using the weights (i.e. instead ov averaging it, you would use weighted average)? – Tim Mar 11 '22 at 17:07
Could you please explain a bit more? By "use just a standard loss but weighted", do you mean, why don't I change my loss function to take into account the weights? – Vladimir Belik Mar 11 '22 at 17:20
@VladimirBelik if you care about weights, this would be a pretty standard approach. Just use weights MSE, or log-loss, or whatever loss makes sense for the problem. – Tim Mar 11 '22 at 17:25
I have heard of this! Maybe I misunderstand, but the reason I haven't done it is mostly a software limitation. I'm working in R, and I don't think all the packages for every algorithm I'm using has an option to incorporate weights into the training loss function. However, I read that doing this oversampling (like I am) should achieve the same effect, so I chose to take that route. For example, RF and logistic regression in R don't, as far as I can see, have any way to access the loss function used in training. But again - doesn't my importance-based oversampling achieve same the end? – Vladimir Belik Mar 11 '22 at 17:29
1

@VladimirBelik not sure about ready implementations in R, but they are trivial to implement, also R has a lot of packages, so I'd expect that you should find it. Yes, oversampling and weighting are technically same. – Tim Mar 12 '22 at 21:39
Hi Tim, I understand. I'm sorry to take more of your time, but all this being said, I think my question still stands. I think we've established that my oversampling/weighting, selection procedure and scoring procedure are all aligned - yet I see consistent overfitting to the cross validation set. Is there any other systematic reason this could occur? Or just possibly "bad luck"? I'm using KNN and logistic regression - perhaps the methods lend themselves to this kind of result? – Vladimir Belik Mar 14 '22 at 19:24
1

@VladimirBelik I can't give you the exact answer without access to the data. But I'll repeat myself: you’re optimizing completely different criterion than using as a evaluation metric, so it's not surprising that the results aren't remarkable. – Tim Mar 14 '22 at 19:49
1

So either change your loss function (as suggested above) or the metric. – Tim Mar 14 '22 at 19:51

Why custom evaluation/scoring metric is causing overfitting in (cross) validation?

1 Answers1