2

I'm trying to detect anomolies in a dataset $i \in \{1,2,...,N\}$ where a random variable $y_i$ is expected to be drawn from a normal distribution with mean $\mu_i=0$ and variance $\sigma_i^2 (X_i)$ totally determined by (conditioned on) the multiple features $X_i$.

My hope is that I can use a Z-score threshold such that anomolies are marked by:

$$Anomolies=\{i \in \{1,2,...,N\} \space | \space |y_i|/\sigma_i > Z_{thresh} \} $$

I am wondering if there is a "good" (e.g. maximum likelihood) way to formulate this as a regression problem in which any machine learning algorithm could be fit on $y^2_i$ given $X_i$ with a suitably-chosen loss function. In this case, presumably the predictions would correspond to estimates of $\sigma_i^2$. But what loss function matches the probabilistic assumption that $y_i$ is drawn from $N(\mu_i=0, \sigma^2_i=f(X_i))$ for some function $f$?

The inspiration for this question is that I was using natural-gradient-based methods like NGBoost to simultaneously fit $\mu_i$ and $\sigma_i$. I've also tried quantile loss tree methods that use the quantile loss function. But here, since I only need to fit $\sigma_i$, it seems there should be a way to formulate fitting $\sigma_i$ as a regular regression problem with a suitable loss function.

A similar question has a response that states Linex Loss with a chosen parameter yields a prediction corresponding to the sum of a mean and variance, but I'm looking for only the variance. This and other questions don't seem to assume that fitting is being done to the reformulated target $y_i^2$, or don't ask about a suitable loss function for this case. An argument against fitting to $y_i^2$ would also be a helpful answer.

Richard Hardy
  • 67,272

1 Answers1

0

But what loss function matches the probabilistic assumption that $y_i$ is drawn from $N(\mu_i=0, \sigma^2_i=f(X_i))$ for some function $f$?

The model is

$$ y_i=f(x_i;\theta)\varepsilon_i, \quad \varepsilon_i \stackrel{i.i.d.}{\sim} N(0,1). $$

I assume $f$ is known up to an unknown parameter (vector) $\theta$. Since the likelihood is normal, the corresponding loss function is quadratic. Nonlinearity of the model w.r.t. its parameter(s) $\theta$ does not affect this basic fact. Thus we should be able to estimate the parameter(s) $\theta$ by nonlinear least squares (as an alternative to doing this by maximum likelihood).

Richard Hardy
  • 67,272
  • I probably do not fully understand your question, but here is something. – Richard Hardy May 05 '23 at 10:38
  • Sorry for the late feedback. I think the model scaling a unit normal is interesting, but I'm unclear on what you mean by estimating the parameters by nonlinear least squares. I derived a correct result in the gradient boosting case but haven't made a post yet. – JoseOrtiz3 Jun 08 '23 at 18:17
  • @JoseOrtiz3, if you are not sure about NLS, just do maximum likelihood. (Gradient boosting is not an estimator, so it does not really compete with these two.) – Richard Hardy Jun 08 '23 at 18:40