Quick question related to maximizing the likelihood as loss function for NN.
In case of regression, by assumption we have that the data is composed by Gaussian noise... however, the wider the noise, the more uncertain should be the prediction.
In "normal" model, we assume homoskedasticity, and the variance becomes just the residuals of the predictions... but I'm not entirely satisfied by this formulation of the problem (I don't like assumptions)
So, by using the MLE we get the MSE as follows: $$ \begin{align*} \theta &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log(p(y_i|x_i;\theta))\\ &= \underset{\theta}{argmin} \sum_{i=0}^\text{|dataset|} (y^{(i)}-\hat{y}^{(i)})^2\\ \end{align*} $$
We have removed the $\sigma$ since is a constant... what about if we let our NN predict that, switching then the distribution from $Y \sim N(model(x, \theta), \sigma^2)$ to $Y \sim N(model(x, \theta)_1, model(x, \theta)_2)$, allowing us to have the following: $$ \begin{align*} \theta &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log(p(y_i|x_i;\theta))\\ &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log\left(\sqrt{\frac{1}{2\pi{\hat\sigma^{(i)}}^2}}exp(-\frac{1}{2{\hat\sigma^{(i)}}^2}(y^{(i)}-\hat{y}^{(i)})^2)\right)\\ &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log\left(\sqrt{\frac{1}{2\pi{\hat\sigma^{(i)}}^2}}\right) + log\left(exp(-\frac{1}{2{\hat\sigma^{(i)}}^2}(y^{(i)}-\hat{y}^{(i)})^2)\right)\\ &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} log\left(exp(-\frac{1}{2{\hat\sigma^{(i)}}^2}(y^{(i)}-\hat{y}^{(i)})^2)\right) - log\left({\hat\sigma^{(i)}}\right)\\ &= \underset{\theta}{argmin} -\sum_{i=0}^\text{|dataset|} -\frac{(y^{(i)}-\hat{y}^{(i)})^2}{2{\hat\sigma^{(i)}}^2}- log\left({\hat\sigma^{(i)}}\right)\\ &= \underset{\theta}{argmin} \sum_{i=0}^\text{|dataset|} \frac{(y^{(i)}-\hat{y}^{(i)})^2}{2{\hat\sigma^{(i)}}^2} + log\left({\hat\sigma^{(i)}}\right)\\ \end{align*} $$
Thus my question... am I missing something, or this should also predict the "uncertainty" of the prediction? If our assumption of heteroskedasticity was correct, then the model will just predict the same variance each time, but it's just a "particular case" of this reasoning
Maybe the model would be better to predict $\log(\sigma)$ in order to avoid "domain" problem... but it should just be a matter of replacing it in the steps with $e^{2log(\sigma)}$
Update:
I tried to implement this using Tensorflow with the following mode:
from keras.layers.core import Activation
from keras.utils.generic_utils import get_custom_objects
def custom_activation(x):
return tf.keras.activations.elu(x, alpha=1.0) + 1
get_custom_objects().update({'custom_activation': Activation(custom_activation)})
class Sampling(tf.keras.layers.Layer):
def call(self, inputs):
z_mean, z_var = inputs
return z_mean + tf.keras.backend.random_normal((z_mean.shape[-1],)) * z_var/2
mod_input = tf.keras.layers.Input(shape=(4,))
N = 10
mod = tf.keras.layers.Dense(N, "selu")(mod_input)
mod = tf.keras.layers.Dense(N, "selu")(mod)
mod = tf.keras.layers.Dense(N, "selu")(mod)
pre_mu = tf.keras.layers.Dense(N, "selu")(mod)
pre_mu = tf.keras.layers.Dense(N, "selu")(pre_mu)
pre_sigma = tf.keras.layers.Dense(N, "selu")(mod)
pre_sigma = tf.keras.layers.Dense(N, "selu")(pre_sigma)
mu = tf.keras.layers.Dense(1, "linear")(pre_mu)
sigma = tf.keras.layers.Dense(1, "custom_activation")(pre_sigma)
output = Sampling()([mu, sigma])
model = tf.keras.Model(mod_input, [mu, sigma, output], name="model")
class MyModel(tf.keras.Model):
def init(self, model, alpha, kwargs):
super(MyModel, self).init(kwargs)
self.model = model
self.alpha = alpha
def train_step(self, data):
x, y = data
y = tf.reshape(y, y.shape+(1,))
with tf.GradientTape() as tape:
mu, sigma, out = self.model(x)
total_loss = tf.math.reduce_mean(
tf.divide(
tf.math.square(out - y),
(2*(sigma**2) + 1e-10)
)
) + tf.reduce_mean(
tf.math.log(sigma + 1e-10)
)
grads = tape.gradient(total_loss, self.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.trainable_weights))
return {
"loss": total_loss,
}
mymodel = MyModel(model, alpha=1)
However it has some problems... in particular, the model completely relies on the variance to explain the targets, instead of the prediction:

So at this point, introducing a prior over the variances, helps reducing this, however there is no good choice (for example, introducing gaussian prior with a constant in front, helps a lot, but then there is not much correlation between sigma and the NN error)
Edit:
As requested, in my opinion, a model that predicts the variance in addition to the mean, makes more sense since you know how much you can rely on that prediction.
As for the code, in my opinion there is a problem of uncorrelation between the error that the NN is making and the predicted variance, because to me makes more sense that there should be some sort of correlation, since variance means a bigger (probability-wise) error
