Training a neural network for regression always predicts the mean

Question

I am training a simple convolutional neural network for regression, where the task is to predict the (x,y) location of a box in an image, e.g.:

The output of the network has two nodes, one for x, and one for y. The rest of the network is a standard convolutional neural network. The loss is a standard mean squared error between the predicted position of the box, and the ground truth position. I am training on 10000 of these images, and validating on 2000.

The problem I am having, is that even after significant training, the loss does not really decrease. After observing the output of the network, I notice that the network tends to output values close to zero, for both output nodes. As such, the prediction of the box's location is always the centre of the image. There is some deviation in the predictions, but always around zero. Below shows the loss:

I have run this for many more epochs than shown in this graph, and the loss still never decreases. Interestingly here, the loss actually increases at one point.

So, it seems that the network is just predicting the average of the training data, rather than learning a good fit. Any ideas on why this may be? I am using Adam as the optimizer, with an initial learning rate of 0.01, and relus as activations

If you are interested in some of my code (Keras), it is below:

# Create the model
model = Sequential()
model.add(Convolution2D(32, 5, 5, border_mode='same', subsample=(2, 2), activation='relu', input_shape=(3, image_width, image_height)))
model.add(Convolution2D(64, 5, 5, border_mode='same', subsample=(2, 2), activation='relu'))
model.add(Convolution2D(128, 5, 5, border_mode='same', subsample=(2, 2), activation='relu'))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='linear'))


# Compile the model
adam = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(loss='mean_squared_error', optimizer=adam)


# Fit the model
model.fit(images, targets, batch_size=128, nb_epoch=1000, verbose=1, callbacks=[plot_callback], validation_split=0.2, shuffle=True)

Are the images on top examples of your actual samples? Is that 5 separate samples? There appears to be no information in the images that would help generalize. I mean, you don't need a neural net to find the x,y location of the white square, you can just parse the image and look for a white pixel. Explain a bit more about your vision for this model. Is there some temporal pattern, whereby you are predicting the next location? — photox, Feb 14 '17 at 02:06
Hi, and yes, the images are 5 separate samples. I'm not sure how they are rendered for you, but they should be 5 individual square images (I've changed the layout a little to help...). Yes, I realise that you don't need a neural network for this task, but it is just a test experiment to help me learn how to do regression with a neural network. I don't understand what you mean by there being no information to help generalize.... Each training pair consists of a square image, and a two-dimensional vector of the (x, y) location of the square. Thanks :) — Karnivaurus, Feb 14 '17 at 11:57
I know that I do not need 3-channels for this (the images I use are actually RGB), or so many layers, but I am just using this for a test case before applying it to more sophisticated images. — Karnivaurus, Feb 14 '17 at 12:30
Yes, I have checked that the images and targets correspond, by drawing the target values on the images and displaying the images. — Karnivaurus, Feb 14 '17 at 12:30
Like @photox says, you do not need the conv layers. Adding these make it more difficult for the optimizer to find a good solution. If you remove the 3 conv layers I suspect your "model" will work. — Pieter, Feb 14 '17 at 22:02
Convolutional layers help with translational invariance due to weight sharing. This doesn't help you at all. As other said before, you would get the result you expect without them. — Firebug, May 16 '18 at 14:05
My guess would be a bug somewhere in the code we're not seeing. I implemented more or less what you had above using newer tf.keras and it trains just fine https://colab.research.google.com/drive/1mrCL2m8y50kd3WQA9IvtDl3-GwVku1gX?usp=sharing — James McKeown, May 14 '21 at 20:01
Perhaps obvious, but a big lurking variable is image size and coordinate system. If I up from 10x10 images to 20x20 it takes way longer to train. If you have large images you'll likely need to downsample with some pooling layers. — James McKeown, May 14 '21 at 20:09

mhdadk · Answer 1 · 2021-10-13T20:14:32.650

I am going to contradict @Pieter's answer and say that your problem is that you have too much bias and too little variance. In other words, your network is not complex enough for this task.

To see this, let $Y$ be the true and correct output that your network should return (the target), and let $\hat{Y}$ be the output that your network actually returns. Your loss-function is the mean squared-error, averaged over all examples in your training dataset $\mathcal{D}$ : $$ \mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] $$ In this loss-function, using your network, we are trying to adjust the probability distribution of $\hat{Y}$ such that it matches the probability distribution of $Y$. In other words, we are trying to make $Y=\hat{Y}$, such that the mean squared-error is $0$. This is the lowest possible value of the mean squared-error: $$ \mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] \geq 0 $$ However, from the question How can I prove mathematically that the mean of a distribution is the measure that minimizes the variance?, we know that the mean squared-error actually has a tighter lower-bound, which is when $\hat{Y} = \mathbb{E}_{\mathcal{D}}[Y]$, such that the mean squared-error loss-function becomes $$ \mathbb{E}_{\mathcal{D}}\left[(Y - \mathbb{E}_{\mathcal{D}}[Y])^2\right] = \text{Var}(Y) $$ Since we know that the variance of $Y$ is non-negative, then the mean squared-error loss-function has the following lower-bounds $$ \mathbb{E}_{\mathcal{D}}\left[(Y - \hat{Y})^2\right] \geq \text{Var}(Y) \geq 0 $$ In your case, you have reached the lower-bound $\text{Var}(Y)$, since you observe that $\hat{Y} = \mathbb{E}_{\mathcal{D}}[Y]$. This means that the bias (strictly speaking, this is not the correct definition of bias, but it gets the point across.) of $\hat{Y}$ is $$ (Y - \hat{Y})^2 = (Y - \mathbb{E}_{\mathcal{D}}[Y])^2 $$ The variance of $\hat{Y}$ is $$ \mathbb{E}_{\mathcal{D}}\left[\left(\hat{Y} - \mathbb{E}_{\mathcal{D}}\left[\hat{Y}\right]\right)^2\right] = \mathbb{E}_{\mathcal{D}}\left[\left(\mathbb{E}_{\mathcal{D}}[Y] - \mathbb{E}_{\mathcal{D}}[\mathbb{E}_{\mathcal{D}}[Y]]\right)^2\right] = 0 $$ Clearly, you have too much bias and too little variance.

So, how do we reach the lower-lower-bound of $0$? We need to increase the variance of $\hat{Y}$ by either adding more parameters to the network or adjusting the network architecture. As discussed in What should I do when my neural network doesn't learn? (highly recommended read), consider over-fitting and then testing your network on a single example by adding many more parameters or by adjusting the network architecture.

If the network no longer predicts the mean on a single example, then you can scale up slowly and start over-fitting and testing the network on two examples, then three examples, and so on. Otherwise, you need to keep adding more parameters/adjusting the network architecture until your network no longer predicts the mean.

Eventually, once you reach a dataset size of around 100 examples, you can start to split your data into training and testing to evaluate the generalization performance of your network. At this point, if it starts to predict the mean again, then make sure that the examples that you are adding to the dataset are similar to the examples that you already worked through in the smaller datasets. In other words, they are normalized and "look" similar. Also, keep in mind that as you add more data to the dataset, you will more likely need to add more parameters for better generalization performance.

Another helpful modification, but not as helpful as what I stated above, that helps in practice, is to slightly adjust the mean squared-error loss function itself. If your mean squared-error loss function is $$ \mathcal{L}(y,\hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2 $$ where $N$ is the dataset size, then consider using the following loss function instead: $$ \mathcal{L}(y,\hat{y}) = \left[\frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2\right] + \alpha \cdot \left[\frac{1}{N} \sum_{i=1}^N (\log(y_i)-\log(\hat{y}_i))^2\right] $$ Where $\alpha$ is a hyperparameter that can be tuned via trial and error. A starting value for $\alpha$ could be $\alpha=5$. The advantage of this loss function over the plain mean squared-error loss function is that the $\log(\cdot)$ function stretches small values in the interval $[0,1]$ away from each other, which means that small differences between $y$ and $\hat{y}$ are amplified, leading to larger gradients. I have personally found this modified loss function to be very helpful in practice.

For this to work well, it is recommended (but not necessary) that $y$ and $\hat{y}$ are each scaled to have values in the interval $[0,1]$. Also, since $\log(0)=-\infty$, and since it is likely that $y$ and $\hat{y}$ will have values very close to $0$, then it is recommended to add a small value $\epsilon$, such as $\epsilon=10^{-9}$, to $y$ and $\hat{y}$ in the loss function as follows: $$ \mathcal{L}(y,\hat{y}) = \left[\frac{1}{N} \sum_{i=1}^N (y_i-\hat{y}_i)^2\right] + \alpha \cdot \left[\frac{1}{N} \sum_{i=1}^N (\log(y_i + \epsilon)-\log(\hat{y}_i + \epsilon))^2\right] $$ This loss function may be thought of as the Mean Squared Log-scaled Error Loss.

chongkai Lu · Answer 2 · 2021-06-09T12:31:16.947

It means that your model decides to "do nothing" rather than prediction. And you should know that "doing nothing" will get a lower error than a random guess for the regression.

For example, if the true values are in the range [-10 10], then keep predicting 0 will return you about 2.5 mean absolute error; In contrast, if you try to guess values in the same range, you will get about 3.3 mean absolute error.

Therefore, if the model finds itself doesn't have the ability to make predictions better than random guesses, or maybe only a little bit better than random guesses (say MAE=2.8), then it will give up and choose to output zeros.

The problem may come from the model design (for example, capacity), or maybe just the data labels are too weak (for example, random labels).

Actually, this problem always appears in regression problems, as long as your data labels and/or models are not perfect, regardless of the data balance.

score 1 · Answer 3 · answered Jun 02 '19 at 03:18

I was actually working on a very similar problem. Basically, I had a bunch of dots on a white background and I was training a NN to recognize the dot that was placed on the background first. The way I found to work was to just use one fully-connected layer of neurons (so a 1-layer NN). For example, for a 100x100 image, I would have 10,000 input neurons (the pixels) directly connected to 2 output neurons (the coordinates). In PyTorch, when I converted the pixel values to a tensor, it was normalizing my data automatically, by subtracting the mean and dividing by the standard deviation. In normal machine learning problems, this is fine, but not for an image where there might be a disparity in the number of colored pixels in an image (i.e. yours where there are only a few white pixels). So, I manually normalized by dividing all pixel intensity values by 255 (so they're now in the range of 0-1 without the typical normalization technique that tries to fit all the intensity values to a normal distribution). Then, I still had issues because it was predicting the average coordinate of the pixels in the training set. So, my solution was to set the learning rate very high, which goes against almost all ML instructors and tutorials. Instead of using 1e-3, 1e-4, 1e-5, like most people say, I was using a learning rate of 1 or 0.1 with stochastic gradient descent. This fixed my issues and my network finally learned to memorize my training set. It doesn't generalize to a testing set too well, but at least it somewhat works, which is a better solution than most everybody else suggested on your question.

mhdadk · Answer 4 · 2023-04-30T20:41:16.797

Here is a more information-theoretic reason for why this is happening. Let $Y$ be the true and correct output that your network should return (the target), and let $\hat{Y}$ be the output that your network actually returns. The problem that you observe is that $\hat{Y}=K$, where $K$ is some constant, which means that $Y$ is independent of $\hat{Y}$.

If we compute the mutual information between $Y$ and $\hat{Y}$, we observe that: $$ I(Y;\hat{Y}) = h(Y) - h(Y|\hat{Y}) = h(Y) - h(Y) = 0 $$ Where $h(\cdot)$ is the differential entropy and $h(Y|\hat{Y})=h(Y)$ since $Y$ and $\hat{Y}$ are independent. In other words, if we assume that $Y \rightarrow \hat{Y}$ is a communications channel, then there is no information flowing between the input of the channel $Y$ and the output of the channel $\hat{Y}$!

A more accurate channel would be the Markov chain $Y \rightarrow X \rightarrow \hat{Y}$, where $X$ is the input to the network. However, by the data processing inequality: $$ I(Y;\hat{Y}) \leq I(Y;X) \\ I(Y;\hat{Y}) \leq I(X;\hat{Y}) $$ Which are just upper-bounds on the mutual information $I(Y;\hat{Y})$. Our objective is to increase the mutual information $I(Y;\hat{Y})$, which can be done by increasing a lower-bound on $I(Y;\hat{Y})$. So, we are not interested in $I(Y;X)$ and $I(X;\hat{Y})$ at this time.

Now let the mean-squared error between $Y$ and $\hat{Y}$ be: $$ \mathbb{E}\left[(Y-\hat{Y})^2\right] = \delta $$ Now, to find a lower-bound on $I(Y;\hat{Y})$, note that: $$ \begin{align} I(Y;\hat{Y}) &= h(Y) - h(Y|\hat{Y}) \\ &= h(Y) - h(Y-\hat{Y}|\hat{Y}) \tag{1} \\ &\geq h(Y) - h(Y-\hat{Y}) \tag{2} \\ &\geq h(Y) - \frac{1}{2} \log{(2\pi e \delta)} \tag{3} \end{align} $$ Where:

$(1)$ is because differential entropy is translation invariant.
$(2)$ is because conditioning reduces entropy.
$(3)$ is because random variable $Z=Y-\hat{Y}$ has variance $\mathbb{E}[(Z-\mathbb{E}[Z])^2]=\mathbb{E}\left[(Y-\hat{Y})^2\right] = \delta$, and so this is the Gaussian bound on differential entropy.

So: $$ I(Y;\hat{Y}) \geq h(Y) - \frac{1}{2} \log{(2\pi e \delta)} $$ If we further assume that the target $Y$ has mean $\mathbb{E}[Y]=0$ and variance $\text{Var}(Y)=\sigma^2$, then we get the tighter lower-bound: $$ I(Y;\hat{Y}) \geq \frac{1}{2} \log{(2\pi e \sigma^2)} - \frac{1}{2} \log{(2\pi e \delta)} \geq h(Y) - \frac{1}{2} \log{(2\pi e \delta)} $$ And so we can increase the mutual information by either increasing the variance $\sigma^2$ of $Y$ or decreasing the mean squared-error $\delta$. Unfortunately, we do not usually have the ability to adjust the variance of $Y$, so we would need to resort to decreasing $\delta$.

Interestingly, we have just shown that minimizing the mean squared-error $\delta$ corresponds to maximizing the mutual information between $Y$ and $\hat{Y}$, and vice-versa. At first glance, it seems like there is a contradiction here: you clearly can achieve small mean squared-error values, which means that the mutual information between $Y$ and $\hat{Y}$ is maximized. However, what you practically observe is that the mutual information between $Y$ and $\hat{Y}$ is $0$.

To resolve this contradiction, it is helpful to take a closer look at the mean squared-error between $Y$ and $\hat{Y}$. We know that: $$ \mathbb{E}\left[(Y-\hat{Y})^2\right] = \text{Var}(\hat{Y}) + \text{Bias}(\hat{Y},Y)^2 $$ Since both of these terms are non-negative, then the variance and bias of $\hat{Y}$ each individually form lower-bounds for the mean squared-error: $$ \mathbb{E}\left[(Y-\hat{Y})^2\right] \geq \text{Var}(\hat{Y}) \\ \mathbb{E}\left[(Y-\hat{Y})^2\right] \geq \text{Bias}(\hat{Y},Y)^2 $$ And therefore the true lower-bounds on the mutual information are: $$ I(Y;\hat{Y}) \geq h(Y) - \frac{1}{2} \log{(2\pi e \cdot \text{Var}(\hat{Y}))} \geq h(Y) - \frac{1}{2} \log{(2\pi e \delta)} \\ I(Y;\hat{Y}) \geq h(Y) - \frac{1}{2} \log{(2\pi e \cdot \text{Bias}(\hat{Y},Y)^2)} \geq h(Y) - \frac{1}{2} \log{(2\pi e \delta)} $$ These new lower-bounds indicate that although your mean squared-error values were small, they were not small enough, as they are upper-bounded by the variance and bias of $\hat{Y}$. Therefore, what we really want to do to maximize the mutual information between $Y$ and $\hat{Y}$ is to decrease the variance and bias of $\hat{Y}$. Unfortunately, we know that there is a trade-off between the variance and bias of $\hat{Y}$.

score 0 · Answer 5 · answered Feb 18 '19 at 19:08

I am facing the same problem with my data set. It turns out that in my case the predictors are highly concentrated with a very small variance. You should check out the variance of your prediction variables and see how it is distributed.

However, some transformations on the output variable can be performed to modify or change its scale. This might result in a more uniform type distribution. For example, in image recognition tasks histogram equalization or contrast enhancement sometimes works in the favor of correct decision making.

score -2 · Answer 6 · answered Oct 02 '17 at 05:05

-2

It looks like a typical overfitting problem. Your data does not provide enough information to get the better result. You choose the complex NN with you train to remember all nuances of the train data. Loss can never be a zero, as it is on your graph. BTW It seems your validation has a bug or validation set is not a good for validation because the validation loss is also getting zero.

answered Oct 02 '17 at 05:05

Leonid Ganeline

97

12

The question says the network almost always outputs zero. That would be a case of severe underfitting, not overfitting. There's also no gap between training and validation error on the learning curve, indicating that overfitting isn't the problem (the error isn't zero, the scale is logarithmic) – user20160 Oct 02 '17 at 08:28

Training a neural network for regression always predicts the mean

6 Answers6

Linked

Related