2

I'm trying to wrap my head around a loss surface in pytorch. This is for work, not a homework assignment.

let's say we have a model

y = model(x)
error = y - y_label

The most simple of loss functions, absolute error

   error.abs().mean().backwards()

Looks like this: enter image description here

The "industry standard" loss function looks like this

(error * error).mean().backwards() # error.pow(2) also works

enter image description here

In my mind, two things are happening here:

  1. The errors are being weighted by the magnitude of the errors
  2. The error function is now non linear, and the gradient is dependant on the error squared

So my question is: Can anyone tell me (like I'm a 5 year old), what their intuition is about the difference between mse_loss and the function below is?

(error * error.detach()).mean().backwards()

1 Answers1

3

Using your notation, the error $e$ is the difference between the model's predictions $\hat y$ and the true target $y$. $$ \texttt{error} = e = \hat{y} - y $$

The MSE loss function $$ L_\text{MSE} = (\hat y -y)^2 $$ has gradient $$ \frac{ \partial L_\text{MSE} }{ \partial \hat y} = 2(\hat y - y). $$

The proposed loss function

(error * error.detach()).mean().backwards()

is not MSE because error.detach() removes the reference to this error from the autograd graph. This means that, from the perspective of autograd, error.detach() is a constant.

So from the perspective of autograd, we can write the loss as

$$ L_\text{proposed} = (\hat y - y)e $$

where $e$ is a constant that just happens to have the exact value $\hat y - y$. This has gradient $$ \frac{\partial L_\text{proposed}}{\partial \hat y} = e=\hat{y} - y $$ which is not the same as MSE, so using $L_\text{proposed}$ will not produce the same updates, or yield the same model as MSE.

Sycorax
  • 90,934
  • Worth noting that this one will actually misbehave due to signs—contrast the gradient with the gradient of MSE. (Also need to indicate in the final equation LHS that it’s a gradient.) – Arya McCarthy Apr 28 '22 at 01:22
  • Thank you, so just to really swing home what you're saying. Will the proposed loss function always result in 1/2 of the mse gradient? – Roderick Obrist Apr 28 '22 at 05:52
  • Yes, that’s what happens when you divide by 2. – Sycorax Apr 28 '22 at 12:04