8

I want to create a regression model with the following properties:

  • prediction should be close to target
  • target and prediction should have the same sign
  • small penalty if either target or prediction are close to 0
  • extra penalty if both are far from zero and have opposite sign

Consider the following loss function: $$\mathcal L _\alpha (y, \hat y) = (y - \hat y)^2 - \alpha y\hat y, $$ where $\alpha \geq 0$ parameter.

Questions:

  • Is it a valid loss function?
  • Does it fulfills the requirements?
  • Can it be improved?
  • Are there any issues with such loss?
  • Have someone tried something similar?
Firebug
  • 19,076
  • 6
  • 77
  • 139
vladkkkkk
  • 691
  • 2
    The standard quadratic loss satisfies all your criteria. – whuber Jul 10 '23 at 22:50
  • 2
    @whuber I think OP's (4) means that, for $x_1,\hat x_1$ and $x_2,\hat x_2$ such that $|x_1-\hat x_1| = |x_2-\hat x_2|$ but, say $x_1,\hat x_1 > 0$ while $x_2 < 0, \hat x_2 > 0$, that we should have $\mathcal{L}(x_1,\hat x_1) < \mathcal{L}(x_2,\hat x_2)$, which is not satisfied by quadratic loss (or any other translation invariant loss). – John Madden Jul 10 '23 at 23:41
  • 1
    dear OP, you could try something like $(y_i-\hat y_i)^2 + \alpha \max(0, -\mathrm{sgn}(y_i) \hat y_i) $, which is convex. – John Madden Jul 10 '23 at 23:54
  • 1
    What is your definition of a loss function for your first bullet point? – Stephan Kolassa Jul 11 '23 at 06:36
  • @JohnMadden thanks a lot for suggestion! Can it be $(y_i - \hat y_i)^2 + \alpha \max (0, -y_i \hat y_i)$ to take into account that for small $y_i$ extra penalty should be small regardless of prediction? – vladkkkkk Jul 11 '23 at 07:16
  • (3) is strange because the algorithm can't control the target. And it conflicts with (1): if the target is, say, 0.001, then what do you want the optimal prediction to be? – usul Jul 11 '23 at 09:35
  • Maybe you can add a logloss between 1(y>0) and 1(y_hat>0). – Lucas Morin Jul 11 '23 at 11:16
  • @lcrmorin How would that work for values that aren’t bounded to $(0, 1)?$ – Dave Jul 11 '23 at 11:17
  • Consider a mixture model with a sign (i.e. asymmetric Rademacher/ shifted Bernoulli) and magnitude component. It's often the case that it is possible to maximize the likelihood of such a parametric model. You can use its negative loglikelihood as a loss function. – Firebug Jul 11 '23 at 12:11
  • 4
    Please post your question about the new loss function as its own question. It deserves a set of answers distinct from those about your first function. – Dave Jul 11 '23 at 12:20
  • @Firebug can you please post some links to read about asymmetric Rademacher/ shifted Bernoulli please? – vladkkkkk Jul 11 '23 at 12:23
  • @John My point is that the criteria are too vague and permit too many different responses. The question needs clarification. – whuber Jul 11 '23 at 13:44
  • i was working on this problem in the context of optical flow estimation, I think that the following paper and supplementary material could be of good help! https://openaccess.thecvf.com/content/WACV2023/html/Savian_Towards_Equivariant_Optical_Flow_Estimation_With_Deep_Learning_WACV_2023_paper.html Regards, Stefano – Jeff Baena Jul 12 '23 at 08:21

1 Answers1

7

I have to admit that I disagree with Dave's answer (but I did upvote it, because it is useful). In my opinion, it makes little sense to consider your loss as a function of two variables when we optimize. After all, we usually cannot influence the true outcome $y$ or its distribution, only our point prediction $\hat{y}$.

Thus, I submit that it is more helpful to think about this as considering your loss as a random variable (through the uncertainty in the outcome $y$) which depends on a variable $\hat{y}$, and trying to understand which value of $\hat{y}$ minimizes, e.g., the expectation of your loss.

I like to investigate things like this through simple simulations. For example, in analogy to your other thread (Loss function that penalizes wrong sign predictions), assume your predictive uncertainty about the outcome can be parameterized as a normal distribution with mean 1 and standard deviation 2. It turns out that the $\hat{y}$ that minimizes the expected loss for $\alpha=1$ is 1.5, i.e., your loss incentivizes us to give a prediction that is higher than the mean. This may well be what you want - but it is important to be aware of this effect.

loss function

R code:

mm <- 1
sd <- 2
xx <- mm+seq(-2*sd,2*sd,by=0.01)

sims <- rnorm(1e6,mm,sd) alpha <- 1 loss <- sapply(xx,function(yy)mean((sims-yy)^2-alphasimsyy)) xx[which.min(loss)]

plot(xx,loss,type="l",xlab="Point prediction",las=1,ylab="Expected loss") abline(v=mm,col="red")

Stephan Kolassa
  • 123,354
  • Yes, I'm aware that it will tend to predict higher (in absolute values) predictions rather than true value. Although, I take that since for the same sign I don't really care about how large deviation is. I proposed another idea in parallel topic link which you have already answered. – vladkkkkk Jul 11 '23 at 14:43