2

Quick question related to maximizing the likelihood as loss function for NN.

In case of regression, by assumption we have that the data is composed by Gaussian noise... however, the wider the noise, the more uncertain should be the prediction.

In "normal" model, we assume homoskedasticity, and the variance becomes just the residuals of the predictions... but I'm not entirely satisfied by this formulation of the problem (I don't like assumptions)

So, by using the MLE we get the MSE as follows: $$ \begin{align*} \theta &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log(p(y_i|x_i;\theta))\\ &= \underset{\theta}{argmin} \sum_{i=0}^\text{|dataset|} (y^{(i)}-\hat{y}^{(i)})^2\\ \end{align*} $$

We have removed the $\sigma$ since is a constant... what about if we let our NN predict that, switching then the distribution from $Y \sim N(model(x, \theta), \sigma^2)$ to $Y \sim N(model(x, \theta)_1, model(x, \theta)_2)$, allowing us to have the following: $$ \begin{align*} \theta &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log(p(y_i|x_i;\theta))\\ &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log\left(\sqrt{\frac{1}{2\pi{\hat\sigma^{(i)}}^2}}exp(-\frac{1}{2{\hat\sigma^{(i)}}^2}(y^{(i)}-\hat{y}^{(i)})^2)\right)\\ &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log\left(\sqrt{\frac{1}{2\pi{\hat\sigma^{(i)}}^2}}\right) + log\left(exp(-\frac{1}{2{\hat\sigma^{(i)}}^2}(y^{(i)}-\hat{y}^{(i)})^2)\right)\\ &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log\left(exp(-\frac{1}{2{\hat\sigma^{(i)}}^2}(y^{(i)}-\hat{y}^{(i)})^2)\right) - log\left({\hat\sigma^{(i)}}\right)\\ &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} -\frac{(y^{(i)}-\hat{y}^{(i)})^2}{2{\hat\sigma^{(i)}}^2}- log\left({\hat\sigma^{(i)}}\right)\\ &= \underset{\theta}{argmin} \sum_{i=0}^\text{|dataset|} \frac{(y^{(i)}-\hat{y}^{(i)})^2}{2{\hat\sigma^{(i)}}^2} + log\left({\hat\sigma^{(i)}}\right)\\ \end{align*} $$

Thus my question... am I missing something, or this should also predict the "uncertainty" of the prediction? If our assumption of heteroskedasticity was correct, then the model will just predict the same variance each time, but it's just a "particular case" of this reasoning

Maybe the model would be better to predict $\log(\sigma)$ in order to avoid "domain" problem... but it should just be a matter of replacing it in the steps with $e^{2log(\sigma)}$

Update:
I tried to implement this using Tensorflow with the following mode:

from keras.layers.core import Activation
from keras.utils.generic_utils import get_custom_objects
def custom_activation(x):
    return tf.keras.activations.elu(x, alpha=1.0) + 1
get_custom_objects().update({'custom_activation': Activation(custom_activation)})

class Sampling(tf.keras.layers.Layer): def call(self, inputs): z_mean, z_var = inputs return z_mean + tf.keras.backend.random_normal((z_mean.shape[-1],)) * z_var/2

mod_input = tf.keras.layers.Input(shape=(4,)) N = 10 mod = tf.keras.layers.Dense(N, "selu")(mod_input) mod = tf.keras.layers.Dense(N, "selu")(mod) mod = tf.keras.layers.Dense(N, "selu")(mod)

pre_mu = tf.keras.layers.Dense(N, "selu")(mod) pre_mu = tf.keras.layers.Dense(N, "selu")(pre_mu) pre_sigma = tf.keras.layers.Dense(N, "selu")(mod) pre_sigma = tf.keras.layers.Dense(N, "selu")(pre_sigma)

mu = tf.keras.layers.Dense(1, "linear")(pre_mu) sigma = tf.keras.layers.Dense(1, "custom_activation")(pre_sigma)

output = Sampling()([mu, sigma])

model = tf.keras.Model(mod_input, [mu, sigma, output], name="model")

class MyModel(tf.keras.Model): def init(self, model, alpha, kwargs): super(MyModel, self).init(kwargs) self.model = model self.alpha = alpha

def train_step(self, data):
    x, y = data
    y = tf.reshape(y, y.shape+(1,))
    with tf.GradientTape() as tape:
        mu, sigma, out = self.model(x)
        total_loss = tf.math.reduce_mean(
            tf.divide(
                tf.math.square(out - y),
                (2*(sigma**2) + 1e-10)
            )
        ) + tf.reduce_mean(
            tf.math.log(sigma + 1e-10)
        )
    grads = tape.gradient(total_loss, self.trainable_weights)
    self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
    return {
        "loss": total_loss,
    }

mymodel = MyModel(model, alpha=1)

However it has some problems... in particular, the model completely relies on the variance to explain the targets, instead of the prediction: enter image description here

So at this point, introducing a prior over the variances, helps reducing this, however there is no good choice (for example, introducing gaussian prior with a constant in front, helps a lot, but then there is not much correlation between sigma and the NN error)

enter image description here

Edit:
As requested, in my opinion, a model that predicts the variance in addition to the mean, makes more sense since you know how much you can rely on that prediction.

As for the code, in my opinion there is a problem of uncorrelation between the error that the NN is making and the predicted variance, because to me makes more sense that there should be some sort of correlation, since variance means a bigger (probability-wise) error

Alberto
  • 1,101
  • You might find tensorflow-probability a useful toolset for this problem. – Demetri Pananos Jun 26 '22 at 16:22
  • Of course you could estimate an expectation and a variance component for each observation, and your derivation is sound. Whether or not it's "better" is context dependent. Can you [edit] to explain what would make a model better or worse? Keep in mind this modeling decision, as with many others, will involve a [tag:bias-variance-tradeoff], so whether or not a more complex model is a sound choice will be context-dependent. – Sycorax Jun 26 '22 at 18:05
  • Also, I think you have a typo here. "If our assumption of heteroskedasticity was correct, then the model will just predict the same variance each time, but it's just a "particular case" of this reasoning" Heterosketasticity refers to having non-constant variance, so I think you mean to refer to homosketasticity. – Sycorax Jun 26 '22 at 18:07
  • @Sycorax yep, my bad, sorry – Alberto Jun 26 '22 at 19:54
  • @Sycorax in the answer you gave, I think that you are assuming homoskedasticity, since you are bringing out the predicted variance (aka, it's independent from the sample)... however, at that point you could just have predicted the mean, and used the MLE for the variance afaik, where instead here we have a "sigma" for each sample – Alberto Jun 26 '22 at 19:57
  • @Sycorax I've edited the answer, but i didn't get it what you mean by the bias variance trade off... what I'm seeing is not a underfitting, nor a overfitting problem, just the model that is relying on the wrong piece to do the prediction (the same model trained using MSE predicts data perfectly) – Alberto Jun 26 '22 at 20:05
  • @Sycorax please unvote for the "closing" of the question so that I can bounty it, since I'm really interested if people can spot the issue – Alberto Jun 26 '22 at 20:06
  • The answer that I linked to has $\hat \sigma_i$ for each sample $i$. // The only question I see in your post is "Thus my question... am I missing something, or this should also predict the "uncertainty" of the prediction?" which, no, you're not, you have a likelihood for heterosketasticity, so everything works. And you seem to agree that it is correct. What's unclear? When you say "spot the issue," are you asking people to debug your code? Because that is not an on-topic question. – Sycorax Jun 26 '22 at 20:09
  • @Sycorax if you have a $\sigma$ for each prediction, how are you able to bring out the summation the $m\log(\hat\sigma)$? – Alberto Jun 26 '22 at 20:10
  • @Sycorax my question is, why on earth my model relies only on the variance to explain the prediction, and not the expected value – Alberto Jun 26 '22 at 20:11
  • There doesn't seem to be any reason to believe that you've found a minimum of the loss function. – Sycorax Jun 26 '22 at 20:26
  • @Sycorax I'm so dumb... I got the same point 2 minutes ago, and since I was tired of this problem, I let the model train for 500 epochs... and the $\mu$ then was correct.... I love you ahaha please, let me know If i can bounty you some points for the help – Alberto Jun 26 '22 at 20:29
  • No need to throw any points my way. As for "dumb," I wouldn't sweat it. Neural networks are complicated machines, with lots of dials to turn & learning how to diagnose them is a lot of work. We have two big threads on debugging them (1) https://stats.stackexchange.com/questions/352036/what-should-i-do-when-my-neural-network-doesnt-learn/352037#352037 (2)https://stats.stackexchange.com/questions/365778/what-should-i-do-when-my-neural-network-doesnt-generalize-well Debugging is mostly checking and re-checking each step of your process (from data to code to execution) whenever you get stuck. – Sycorax Jun 26 '22 at 20:35
  • @Sycorax yeah absolutely, but I'm going to get a degree in DataScience in a year, and I'm pretty confident, however every now and then I get this problems, and when it turns out that the problem was just the n. of epoch, I fell soo dumb ahah – Alberto Jun 26 '22 at 22:00

0 Answers0