Regression metrics when underestimation is worse than overestimation

Question

I am trying to predict time that it will take to complete some task given some data. However the important thing to me is that I would rather prefer the model to overestimate that time than underestimate even if the overall error would be smaller in the second case.

Which loss function and metrics should I use in such situation?

You can write down a loss function and then write code to minimise it. In principle that is the entire solution. Or you might find that working with root or log of time (the latter only if all times are positive) gives you an adequate approximation. — Nick Cox, Jul 13 '20 at 09:52
@NickCox What would you say to quantile regression at, say, quantile $0.25?$ This would make the model prefer to miss low than to miss high. — Dave, Jul 13 '20 at 09:56
How can I use it with library models like for example sklearns random forest regressor? — Slajni, Jul 13 '20 at 10:00
@Dave That's changing the question, but the answer might be helpful. — Nick Cox, Jul 13 '20 at 10:59
How to do any of this with your preferred software is a different question and in any event I couldn't offer advice on software I've never used. — Nick Cox, Jul 13 '20 at 11:00

score 2 · Answer 1 · edited Feb 20 '23 at 16:45

2

You might be interested in quantile regression. When you run a quantile regression, you get to decide how much high misses and low misses are penalized, and these do not have to be equal. You could fit a low quantile (perhaps quantile $0.75$) so that the model tends to aim high.

Quantile regression optimizes the following loss function $L_{\tau}$, where $\tau$ is the quantile you want to estimate.

$$ l_{\tau}(y_i, \hat y_i) = \begin{cases} \tau\vert y_i - \hat y_i\vert, & y_i - \hat y_i \ge 0 \\ (1 - \tau)\vert y_i - \hat y_i\vert, & y_i - \hat y_i < 0 \end{cases}\\ L_{\tau}(y, \hat y) = \sum_{i=1}^n l_{\tau}(y_i, \hat y_i) $$

When $\tau=0.5$, low and high misses are penalized equally. If $\tau>0.5$, missing low incurs a more severe penalty than missing high, incentivizing your model to miss high rather than miss low.

As far as Python goes, quantile random forests appear to be implemented in scikit-garden. More common (even if not what works for you) would be a linear quantile regression, which is implemented in sklearn and in statsmodels.

edited Feb 20 '23 at 16:45

John Madden

4,165
2
20
34

answered Feb 20 '23 at 16:01

Dave

62,186

I'm not sure I say anything different in this answer than in this post, however. – Dave Feb 20 '23 at 16:01
1

Do you mind if I edit a picture of the hinge loss into this answer? – John Madden Feb 20 '23 at 16:16
@JohnMadden Hinge or pinball? – Dave Feb 20 '23 at 16:19
oh yes pinball indeed (how did I get such descriptive names mixed up ;)) – John Madden Feb 20 '23 at 16:20
@JohnMadden Sure, edit in a picture of pinball loss. // A potentially related discussion about quantile regression. – Dave Feb 20 '23 at 16:21
I added an ugly image in there, can't figure out how to center it on this website tho... – John Madden Feb 20 '23 at 16:45
similar answer without piecewise linear objective function would be expectiles or M-quantiles. – Josef Feb 20 '23 at 17:47

Regression metrics when underestimation is worse than overestimation

1 Answers1