Should loss function be defined over output or parameters?

Question

In machine learning loss is usually defined over the actual output and the predicted output $L(Y,\hat{Y}(X))$, while in statistics it's defined in the parameter space $L(\theta,\hat{\theta}(X))$. Why? I assume one reason is that we only assume parametric models in statistics in this case while the ML loss is more general and covers both parametric and non-parametric cases. Is there any other reason?

well even in ml the second formulation makes more sense as weight decay is a thing — shimao, Mar 26 '19 at 14:37

score 1 · Accepted Answer · answered Mar 27 '19 at 03:25

In machine learning loss is usually defined over the actual output and the predicted output $L(Y,\hat{Y}(X))$, while in statistics it's defined in the parameter space $L(\theta,\hat{\theta}(X))$.

That's not quite right. Loss is always defined by comparing a prediction from a potential model to the target in the data

$$L(Y, \hat{Y})$$

Sometimes our statistical model is defined by some small(ish) number of parameters $\hat \theta$ (*), which would allow us to express $\hat Y$ as a function of the data $X$ and the proposed parameters $\hat \theta$

$$\hat{Y} = \hat{Y}(X, \hat \theta)$$

which would make the loss a function of the target, the input data, and the parameters

$$L(Y, \hat{Y}) = L(Y, \hat{Y}(X, \hat \theta))$$

Notice that I did not write $L(\theta,\hat{\theta}(X))$, that would require knowing the true value of the parameter $\theta$, which you never know. You only ever have access to $Y$ and $\hat Y$.

As an addendum:

I assume one reason is that we only assume parametric models in statistics in this case while the ML loss is more general and covers both parametric and non-parametric cases.

That's not a fair characterization of statistics. Statisticians are just as interested in non-parametric models as machine learning researchers, and have arguably been studying them for just as long.

(*) I'm following the notation from the question here. $\hat \theta$ is usually used for the estimated value of the parameter $\theta$, but the poster used the symbol $\theta$ for the true value of the parameter.

Should loss function be defined over output or parameters?

1 Answers1