2

In Deep Learning by Goodfellow, Bengio, and Courville they show that using MSE as a loss function has the same result as maximizing the likelihood (MLE) of an assumed Gaussian model (Section 5.5.1):

$-m \log(\sigma) - \frac{m}{2}\log(2 \pi) - \sum_{i=1}^{m} \frac{||\hat{y}^{(i)} - y^{(i)} ||^2}{2 \sigma^2}$

where $m$ is the number of examples, $\hat{y}^{(i)}$ is the output of a model (presumably a neural network) and $y^{(i)}$ is the ground truth. This seems pretty clear to me. In pyTorch, I can implement the MSE loss function as

def custom_loss(outputs, targets):
    """ MSE, but written by me """
    return torch.mean(torch.pow(outputs - targets, 2))

This function works on a network with 1 output node. There is a built-in MSE, but I want to walk before I run here. This works; I can easily fit linear functions that are 'corrupted' by noise with this loss function. Later, in Section 6.2.2.4, they write

For example, we may wish to learn the variance of a conditional Gaussian for y, given x. In the simple case, where the variance $\sigma^2$ is a constant, there is a closed form expression because the maximum likelihood estimator of variance is simply the empirical mean of the squared difference between observations y and their expected value.

Sure, but how to I implement this in pyTorch (my preferred neural network language)? If I simply chuck it into the custom_loss function, the two terms are going to compete. Without much else to go on, I can try the following loss function:

def custom_loss(outputs, targets):
    """ MLE, but written by me """
    prec = 1 / torch.var(outputs)
    not_mse = 0.5 * prec * torch.mean(torch.pow(outputs - targets, 2))
    return not_mse - 0.5 * torch.log(prec)

This function simply uses the MLE formula from above (ignoring the constant term). Again I have a single output node. Of course this doesn't work.

first failure second failure

The dashed line is the function to learn, the red circles are the function the network should learn and the blue/magenta dots are the training and test sets, although they are hard to tell apart, they are there. I have $x \in [-3, 3]$, $y=m*x+b$ with $m=2.3$ and $b=1.7$, to which I add Gaussian noise with $\sigma=1$. I train on 10000 points and test on 1000. I arbitrarily chose a (1, 50, 50, 1) network with ReLU activation and bias nodes. This network easily fits the dashed line with the first custom_loss function. The text continues

A computationally more expensive approach that does not require writing special-case code is to simply include the variance as one of the properties of the distribution $p(\mathbf{y}|\mathbf{x})$ that is controlled by $\omega = f(\mathbf{x};\mathbf{\theta})$. The negative log-likelihood $-\log p(\mathbf{y};\mathbf{\omega}(\mathbf{x}))$ will then provide a cost function with the appropriate terms necessary to make our optimization procedure incrementally learn the variance.

First, I don't know what a "propert[y] of the distribution" is. But this really just looks like the first equation (multiplied by -1 to minimize loss instead of maximizing likelihood) and I've already used it above and it doesn't work. And it is obvious that it shouldn't work because of the term competition.

In the simple case where the standard deviation does not depend on the input, we can make a new parameter in the network that is copied directly into $\mathbf{\omega}$.

I have no idea what this means, nor what to do with it. I just pick a weight and say "this is the variance" and off the MSE optimizer goes to optimize things? This is supervised learning, how do I pick a target for $\sigma$? The text makes it seem like I don't have to.

At this point the text veers into computation issues in selecting precision versus variance versus standard deviation. Clearly, they think that their task of explaining how to fit the variance is done and only the implementation details remain, but I have no good idea what to do.

How do I learn the variance of my Gaussian model using a neural network?


Some code that might make your life easier when answering.

A pyTorch dataset class that will make a line with noise:

class RegDataset(torch.utils.data.Dataset):
    def __init__(self, m: float, b: float, stdev: float = 1, 
                 limits: List[float] = [-1, 1], n: int = 1000):
        self.m = float(m)
        self.b = float(b)
        self.stdev = abs(stdev)
        self.limits = self._check_limits(limits)
        self.n = int(n)
        self.dtype = torch.float32
    self.X = self.make_x()
    self.Y = self.add_noise(self.line(self.X))

@staticmethod
def _check_limits(limits: List[float]) -> List[float]:
    assert len(limits) == 2, 'limits must be a list of two numerics'
    assert limits[0] != limits[1], 'limits must not be equal'
    if limits[0] > limits[1]:
        return [limits[1], limits[0]]
    return limits

def make_x(self) -> torch.tensor:
    u = torch.rand((self.n,1), dtype=self.dtype)
    return (self.limits[1] - self.limits[0]) * u + self.limits[0]

def line(self, x: torch.tensor) -> torch.tensor:
    return self.m * x + self.b
    #return self.m * torch.pow(x, 3) + self.b

def add_noise(self, y: torch.tensor) -> torch.tensor:
    return y + torch.randn((y.shape[0], 1), dtype=self.dtype) * self.stdev

def make_line(self) -> (torch.tensor, torch.tensor):
    x = torch.linspace(self.limits[0], self.limits[1], 100)
    return x, self.line(x)

def compute_mse(self) -> float:
    return torch.pow(self.Y - self.line(self.X), 2).sum() / self.n

def __len__(self) -> int:
    return len(self.X)

def __getitem__(self, idx) -> (torch.tensor, torch.tensor):
    return self.X[idx], self.Y[idx]

A pyTorch class that will make an MLP network:

class MLP(torch.nn.Module):
    """
    """
def __init__(self,
             input_size: int,
             output_size: int = 1,
             layers: Tuple[int] = (64,),
             dropout_fraction: float = 0.5,
             include_bias: bool = True
             ):
    """
    A neural network for regression.

    Parameters
    ----------
    input_size : int
        The size of the input layer
    output_size : int
        The size of the output layer, default is 1
    layers : tuple
        The MLP feed forward network shape: unit counts for each layer
    dropout_fraction : float
        The fraction of connections that should be dropped during training
    include_bias : bool
        Should the linear layers have a bias?
    """
    super().__init__()
    self.input_size = input_size
    self.output_size = output_size
    self.layers = layers
    self.dropout_fraction = dropout_fraction
    self.bias = include_bias

    # required so pytorch can find the parameters
    self.mlp = torch.nn.ModuleList()
    self.dropouts = torch.nn.ModuleList()

    for idx, l in enumerate(self.layers):
        if idx == 0:
            last_layer_numbers = self.input_size
        else:
            last_layer_numbers = self.mlp[idx - 1].out_features

        self.mlp.append(torch.nn.Linear(last_layer_numbers, l,
                                        bias=self.bias))

        if idx < len(self.layers) - 1:
            self.dropouts.append(torch.nn.Dropout(p=self.dropout_fraction))
        else:
            self.dropouts.append(None)
    self.mlp.append(torch.nn.Linear(self.layers[-1], self.output_size,
                                    bias=False))

def forward(self, x: torch.tensor) -> torch.tensor:
    """
    Forward prop on the neural network.
    """
    for layer, dropout in zip(self.mlp[:-1], self.dropouts):
        x = layer(x)
        x = torch.nn.functional.relu(x)
        if dropout:
            x = dropout(x)
    x = self.mlp[-1](x)
    return x

An "environment" to do training and testing:

class MLPEnvironment:
    """
    """
def __init__(self,
             train_loader: torch.utils.data.DataLoader,
             test_loader: torch.utils.data.DataLoader,
             net: MLP,
             optimizer: Type[torch.optim.Optimizer] = None,
             loss_function: Type[torch.nn.MSELoss] = None,
             n_epochs: int = 3,
             learning_rate: float = 0.01,
             momentum: float = 0.5,
             log_interval: int = 50,
             random_seed: int = 17
             ):
    """
    Environment for training a neural network.


    Parameters
    ----------
    train_loader: RegDataset
        The source of the training data.
    test_loader: RegDataset
        The source of the test data.
    net: MLP
        The network to train / test.
    optimizer: Type[torch.optim.Optimizer]
        The optimizer to use for training.  Default is Adam.
    loss_function: Type[torch.nn.MSELoss]
        Loss function to use.  Default is MSE.
    n_epochs: int
        How many epochs to train on top of any previous training.
    learning_rate: float
        The learning rate.
    momentum: float
        Learning momentum.
    log_interval: int
        How often to log learning progress.  Number of batches.
    random_seed: int
        Seed for the random number generator.
    """
    self.train_loader = train_loader
    self.test_loader = test_loader
    self.net = net
    self.n_epochs = n_epochs
    self.trained_epochs = 0
    self.learning_rate = learning_rate
    self.momentum = momentum
    if optimizer is None:
        self.optimizer = torch.optim.Adam(
            self.net.parameters(),
            lr=self.learning_rate
        )
    if loss_function is None:
        self.loss_function = torch.nn.MSELoss(reduction='mean')
    else:
        self.loss_function = loss_function
    self.log_interval = log_interval
    self.random_seed = random_seed

    # turn off cuda
    torch.backends.cudnn.enabled = False
    # de-randomize
    torch.manual_seed(self.random_seed)

    self.train_losses = []
    self.train_counter = []
    self.test_losses = []
    self.test_counter = []
    self.loss = None

def train_net(self, new_epochs: int = 0, verbose=True) -> None:
    if new_epochs > 0:
        n_epochs = new_epochs
    else:
        n_epochs = self.n_epochs  # run the default
    start_epoch = self.trained_epochs + 1
    for epoch in range(start_epoch, start_epoch + n_epochs):
        self.train(epoch, verbose=verbose)
        self.test(verbose=verbose)

def train(self, epoch: int, verbose=True) -> None:
    num_train = len(self.train_loader.dataset)
    train_seen = 0
    self.net.train()  # set the network to train mode
    for batch_idx, (data, target) in enumerate(self.train_loader):
        data = data.float()
        target = target.float()

        self.optimizer.zero_grad()  # reset gradients
        output = self.net.forward(data)  # compute predictions
        self.loss = self.loss_function(output, target)  # compute loss
        self.loss.backward()  # backprop
        self.optimizer.step()  # training step

        train_seen += len(data)

        if batch_idx % self.log_interval == 0:
            if verbose:
                print(
                    'Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.2e}'
                        .format(epoch, train_seen, num_train,
                                100. * train_seen / num_train,
                                self.loss.item()
                                )
                )
            self.train_losses.append(self.loss.item())
            self.train_counter.append(
                (train_seen) +
                ((epoch - 1) * num_train)
            )
    self.trained_epochs += 1

def test(self, verbose=True) -> None:
    self.net.eval()  # set the network to evaluation mode
    test_loss = 0
    number_test_sets = 0
    with torch.no_grad():  # don't compute gradients
        for data, target in self.test_loader:
            data = data.float()
            target = target.float()
            output = self.net.forward(data)
            test_loss += self.loss_function(output, target).item()
            number_test_sets += 1
    test_loss /= number_test_sets
    self.test_losses.append(test_loss)
    self.test_counter.append(self.trained_epochs *
                             len(self.train_loader.dataset))
    if verbose:
        print(
            '\nTest set: Avg. loss: {:.2e}\n'.format(
                test_loss)
        )

def plot_performance(self, logx: bool = False) -> None:
    plt.figure()
    plt.semilogy([u / 1000 for u in self.train_counter], self.train_losses,
                 color='blue')
    plt.semilogy([u / 1000 for u in self.test_counter], self.test_losses,
                 color='red', linestyle='', marker='o')
    plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
    plt.xlabel('number of training examples seen (1000s)')
    plt.ylabel('{}'.format(repr(self.loss_function)[:-2]))
    if logx:
        ax = plt.gca()
        ax.set_xscale('log')
    plt.show()

def count_weights(self) -> int:
    """ counts the number of weights trained in the network """
    count = 0
    for p in self.net.mlp.parameters():
        if len(p.shape) == 2:
            count += np.product(p.shape)
    return count

All of these make constructing the problem and playing around easy:

m = 2.3
b = 1.7
limits = [-3.0, 3.0]
stdev = 1.0
n_train = 10000
n_test = 1000
batch_size = 500

def custom_loss(outputs, targets): """ MLE, but written by me """ prec = 1 / torch.var(outputs) not_mse = 0.5 * prec * torch.mean(torch.pow(outputs - targets, 2)) return not_mse - 0.5 * torch.log(prec)

train_dataset = RegDataset(m, b, stdev, limits, n_train) test_dataset = RegDataset(m, b, stdev, limits, n_test)

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=False) test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

nnet = MLP(1, 1, (50, 50), dropout_fraction=0.0)

MLPEnv = MLPEnvironment(train_dataloader, test_dataloader, nnet, loss_function=custom_loss, n_epochs=1, log_interval=1)

print(nnet.mlp) print('Number of weights to train: {:d}'.format(MLPEnv.count_weights())) print('NOTE: pytorch does not count bias weights for some reason.')

Then train and plot:

for _ in range(10):
    MLPEnv.train_net(verbose=True)

MLPEnv.plot_performance(logx=True)

  • With $\omega$ they simply refer to an output neuron that does not depend on the Input features , it's just a constant. However it's a constant that still needs to be learned by network. One way to do this is to just make a single learnable weight , and copy it across all rows of your batch. Then optimize that along with your main mean prediction neuron ( which obviously depends on all features like a usual network ) – Georg M. Goerg Aug 18 '23 at 03:36

1 Answers1

7

How do I learn the variance of my Gaussian model using a neural network?

The problems you outline arise because you're not estimating the variance of the target distribution, you're estimating the variance of the predictions.

You need a network that has two outputs, $\hat y$ and $ \hat \sigma$.

Then use $$ \sum_{i=1}^{m} \left[ \frac{\left(\hat{y}_i - y_i \right)^2}{2 \hat \sigma_i^2} + \log(\hat \sigma_i) \right] $$ as the loss function, taking three arguments: $\hat \sigma_i, \hat y_i, y_i$.

(If each $y_i$ is a vector, then you'll generalize to the multivariate Gausisan case, which is straightforward to derive from the multivariate normal pdf.)

You'll either need to enforce that $\sigma_i$ is positive or parameterize your loss (and model) in terms of $\gamma_i = \log \sigma_i$ (which can be either positive or negative).

I just pick a weight and say "this is the variance" and off the MSE optimizer goes to optimize things?

Not a weight, an output of the model. But yeah, one of the outputs is $\hat \sigma_i$ and one of the outputs is $\hat y_i$.

This is supervised learning, how do I pick a target for $\sigma$?

You don't need targets for $\sigma$ because the negative log-likelihood does it all. If $\hat \sigma_i$ gets too large, then $\log \hat \sigma_i$ gets larger. If $\hat \sigma_i$ gets too small, then the MSE term gets too large. You want to find a compromise value that minimizes the loss function, so you optimize the loss.

See also: Improvement in NN regressor by Negative Log Liklihood loss vs MSE loss

Sycorax
  • 90,934
  • Where does that loss expression come from? Can it be generalized to learn to predict the parameters of non-gaussian distributions as well? – KarelPeeters Jan 29 '23 at 23:40
  • 1
    It’s derived from the pdf of a normal distribution. See: https://stats.stackexchange.com/q/378274/22311 – Sycorax Jan 30 '23 at 00:56
  • 1
    @ToddSewell yes that's the power of distribution learning! I suggest you take a look at pyro https://github.com/pyro-ppl/pyro ( or tensorflow-probability for equivalent in tf / keras) – Georg M. Goerg Aug 18 '23 at 03:33