Loss function for estimating the conditional variance by fitting $y_i^2$

Question

I'm trying to detect anomolies in a dataset $i \in \{1,2,...,N\}$ where a random variable $y_i$ is expected to be drawn from a normal distribution with mean $\mu_i=0$ and variance $\sigma_i^2 (X_i)$ totally determined by (conditioned on) the multiple features $X_i$.

My hope is that I can use a Z-score threshold such that anomolies are marked by:

$$Anomolies=\{i \in \{1,2,...,N\} \space | \space |y_i|/\sigma_i > Z_{thresh} \} $$

I am wondering if there is a "good" (e.g. maximum likelihood) way to formulate this as a regression problem in which any machine learning algorithm could be fit on $y^2_i$ given $X_i$ with a suitably-chosen loss function. In this case, presumably the predictions would correspond to estimates of $\sigma_i^2$. But what loss function matches the probabilistic assumption that $y_i$ is drawn from $N(\mu_i=0, \sigma^2_i=f(X_i))$ for some function $f$?

The inspiration for this question is that I was using natural-gradient-based methods like NGBoost to simultaneously fit $\mu_i$ and $\sigma_i$. I've also tried quantile loss tree methods that use the quantile loss function. But here, since I only need to fit $\sigma_i$, it seems there should be a way to formulate fitting $\sigma_i$ as a regular regression problem with a suitable loss function.

A similar question has a response that states Linex Loss with a chosen parameter yields a prediction corresponding to the sum of a mean and variance, but I'm looking for only the variance. This and other questions don't seem to assume that fitting is being done to the reformulated target $y_i^2$, or don't ask about a suitable loss function for this case. An argument against fitting to $y_i^2$ would also be a helpful answer.

@Dave I suppose my question assumes the mean is zero, whereas yours desires no assumptions on the mean. That Linex Loss answer yields a predictor that predicts the sum of the mean and variance, so it's not what I'm looking for (or what the question it tries to answer was looking for). — JoseOrtiz3, May 04 '23 at 19:38
If the mean is zero, then the sum of the mean and the variance equals the variance. — Dave, May 04 '23 at 19:39
@Dave I've narrowed down my question. Thanks for your links and efforts towards this! — JoseOrtiz3, May 04 '23 at 19:54
What do you think about my answer? If it is helpful and clear, you may accept it by clicking on the tick mark to the left. Otherwise, you may ask for further clarification. This is how Cross Validated works. — Richard Hardy, Jun 04 '23 at 12:49

score 0 · Answer 1 · answered May 05 '23 at 10:37

0

But what loss function matches the probabilistic assumption that $y_i$ is drawn from $N(\mu_i=0, \sigma^2_i=f(X_i))$ for some function $f$?

The model is

$$ y_i=f(x_i;\theta)\varepsilon_i, \quad \varepsilon_i \stackrel{i.i.d.}{\sim} N(0,1). $$

I assume $f$ is known up to an unknown parameter (vector) $\theta$. Since the likelihood is normal, the corresponding loss function is quadratic. Nonlinearity of the model w.r.t. its parameter(s) $\theta$ does not affect this basic fact. Thus we should be able to estimate the parameter(s) $\theta$ by nonlinear least squares (as an alternative to doing this by maximum likelihood).

answered May 05 '23 at 10:37

Richard Hardy

67,272

I probably do not fully understand your question, but here is something. – Richard Hardy May 05 '23 at 10:38
Sorry for the late feedback. I think the model scaling a unit normal is interesting, but I'm unclear on what you mean by estimating the parameters by nonlinear least squares. I derived a correct result in the gradient boosting case but haven't made a post yet. – JoseOrtiz3 Jun 08 '23 at 18:17
@JoseOrtiz3, if you are not sure about NLS, just do maximum likelihood. (Gradient boosting is not an estimator, so it does not really compete with these two.) – Richard Hardy Jun 08 '23 at 18:40

Loss function for estimating the conditional variance by fitting $y_i^2$

1 Answers1