4

I am trying to predict time that it will take to complete some task given some data. However the important thing to me is that I would rather prefer the model to overestimate that time than underestimate even if the overall error would be smaller in the second case.

Which loss function and metrics should I use in such situation?

Slajni
  • 45
  • 1
    You can write down a loss function and then write code to minimise it. In principle that is the entire solution. Or you might find that working with root or log of time (the latter only if all times are positive) gives you an adequate approximation. – Nick Cox Jul 13 '20 at 09:52
  • @NickCox What would you say to quantile regression at, say, quantile $0.25?$ This would make the model prefer to miss low than to miss high. – Dave Jul 13 '20 at 09:56
  • How can I use it with library models like for example sklearns random forest regressor? – Slajni Jul 13 '20 at 10:00
  • @Dave That's changing the question, but the answer might be helpful. – Nick Cox Jul 13 '20 at 10:59
  • How to do any of this with your preferred software is a different question and in any event I couldn't offer advice on software I've never used. – Nick Cox Jul 13 '20 at 11:00

1 Answers1

2

You might be interested in quantile regression. When you run a quantile regression, you get to decide how much high misses and low misses are penalized, and these do not have to be equal. You could fit a low quantile (perhaps quantile $0.75$) so that the model tends to aim high.

Quantile regression optimizes the following loss function $L_{\tau}$, where $\tau$ is the quantile you want to estimate.

$$ l_{\tau}(y_i, \hat y_i) = \begin{cases} \tau\vert y_i - \hat y_i\vert, & y_i - \hat y_i \ge 0 \\ (1 - \tau)\vert y_i - \hat y_i\vert, & y_i - \hat y_i < 0 \end{cases}\\ L_{\tau}(y, \hat y) = \sum_{i=1}^n l_{\tau}(y_i, \hat y_i) $$

enter image description here

When $\tau=0.5$, low and high misses are penalized equally. If $\tau>0.5$, missing low incurs a more severe penalty than missing high, incentivizing your model to miss high rather than miss low.

As far as Python goes, quantile random forests appear to be implemented in scikit-garden. More common (even if not what works for you) would be a linear quantile regression, which is implemented in sklearn and in statsmodels.

John Madden
  • 4,165
  • 2
  • 20
  • 34
Dave
  • 62,186