8

Consider the following loss function: $$\mathcal L (y, \hat y) = |y| \left[\log (1 + |y - \hat y|^2) \mathbf 1 _{\{y\hat y \geq 0\}} + |y - \hat y|^2\mathbf 1 _{\{y\hat y < 0\}} \right ]$$

The idea is to penalize predictions if their sign is different from target for large targets, while keep overall loss small for small targets. Are there any problems with such loss function?

vladkkkkk
  • 691

1 Answers1

10

When I think about a loss function for a point prediction, my mental model always runs like this (Kolassa, 2020): we don't know the outcome we want to predict, so it's best to think about it in terms of a predictive probability density. A point prediction is a one-number summary of this predictive density. Now: given some predictive density, which one-number summary will lead to the lowest expected loss? Essentially, any loss function elicits a particular functional from the predictive density: the MSE elicits the mean, the MAE elicits the median, a pinball loss elicits a quantile. Which functional does your loss function elicit?

(And yes, I maintain this is still a useful way of thinking even if you do not consider a predictive density explicitly - because your uncertainty is always there in the background, whether you choose to ignore it or not.)

In the present case, assume for example your predictive density is normal, with a mean of 1 and a standard deviation of 2. (Or alternatively, assume your uncertainty about the outcome can be described in this way.) Then it turns out that the optimal point prediction under your loss is zero, i.e., zero is the $\hat{y}$ that minimizes your expected loss. Thus, your loss does not elicit an unbiased expectation prediction.

expected loss

By symmetry, zero is also the optimum point prediction if your conditional expectation is -1. As the SD goes down, the optimal point prediction will eventually move away from zero.

This may or may not be what you expected. After all, the pinball loss is explicitly built to elicit a quantile prediction. However, given that it seems to be little known and appreciated that loss functions like the MAE or the MAPE also elicit non-expectation functionals, I think that this aspect is relevant.

R code:

mm <- 1
sd <- 2
xx <- mm+seq(-2*sd,2*sd,by=0.01)

sims <- rnorm(1e6,mm,sd) loss <- sapply(xx,function(yy)mean(abs(yy)(log(1+(yy-sims)^2)(simsyy>=0)+(yy-sims)^2(yy*sims<0)))) xx[which.min(loss)]

plot(xx,loss,type="l") abline(v=mm,col="red")

Stephan Kolassa
  • 123,354
  • Essentially, any loss function elicits a particular functional from the predictive density: the MSE elicits the mean, the MAE elicits the median, a pinball loss elicits a quantile. I do find it useful to think this way. However, how would you fit penalized estimation like ridge, LASSO, or elastic net into this framework? // I can post this as its own question, if you think it deserves a full answer. – Dave Jul 11 '23 at 13:21
  • 2
    @Dave: I think regularization is really orthogonal to the loss function. You can use the MSE to elicit a conditional expectation, or the MAE for the median, and use any kind of regularization to reduce overfitting. Does that make sense? – Stephan Kolassa Jul 11 '23 at 13:23
  • That's how I want to think about it, but I still struggle with the fact that MAE can be minimized when conditional means are desired, and this can quell some overfitting to "outlier"-type points, so I am not sure why MAE would be so different from ridge in this sense. Then why would the proposed loss function be any different? (I think this mean I've found my Tuesday question to post.) – Dave Jul 11 '23 at 13:26
  • 1
    @Dave: Post away, and please post the link here! If your conditional distribution is nice and symmetric, then of course the MAE and the MSE have the same minimizer, since the mean and median coincide, and then the discussion is not so much about the minimizer any more. However, when you do intermittent demand forecasting, the MAE can make flat zero forecasts look spuriously attractive. (And I still think regularization-or-not is orthogonal...) – Stephan Kolassa Jul 11 '23 at 13:28
  • Regularization just bias the predictions, it's still a conditional statistic – Firebug Jul 11 '23 at 14:50
  • Using your framework I think as follows: I want to estimate mean given the distribution of output depends on sign of target. If sign doesn't match I assume the distribution is Gaussian, while if it matches the distribution is t (that is covered by $\log (1 + x^2)$ term). But I also almost don't care about predictions if target value is small, that's why I multiplied it by $|y|$. The result loss looks too complicated to me, so I'm asking whether I'm on the right way of fulfilling my needs :) – vladkkkkk Jul 11 '23 at 15:03
  • One aspect is that whereas most loss functions elicit a functional that is independent of the predictive distribution itself (MSE elicits the mean, MAE the median, pinball a quantile), what you want to elicit depends on the parameters of the predictive distribution. So your loss function quite likely has to be more complicated. – Stephan Kolassa Jul 12 '23 at 07:11
  • 2
    You could also look at weighted sums of the MSE (to elicit the mean) and $\text{sgn}(y\hat{y})$ (to elicit the right sign), with the weighting depending on $y$ via some function. I would recommend playing around with some "reasonable" predictive distributions and looking at what point prediction minimizes any candidate loss, along the lines of my simulations here and elsewhere. (I do such simulations almost always when I look at a newly proposed loss function.) – Stephan Kolassa Jul 12 '23 at 07:12
  • I think it's exhaustive answer, will work in that direction. Thanks a lot for fruitful discussion! – vladkkkkk Jul 12 '23 at 08:53
  • 1
    @StephanKolassa we'll probably want a some kinda relaxation of $\mathrm{sgn}(y\hat y)$, as that's a discontinuous function. After all, we've got to optimize this thing somehow once we're happy with its statistical properties! – John Madden Jul 12 '23 at 13:14
  • 3
    @JohnMadden: yes, of course. Then again, "penalize different signs between the prediction and the outcome" is inherently a discontinuous objective, which makes our life harder precisely near zero... – Stephan Kolassa Jul 12 '23 at 13:53