39

I am training a simple convolutional neural network for regression, where the task is to predict the (x,y) location of a box in an image, e.g.:

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

The output of the network has two nodes, one for x, and one for y. The rest of the network is a standard convolutional neural network. The loss is a standard mean squared error between the predicted position of the box, and the ground truth position. I am training on 10000 of these images, and validating on 2000.

The problem I am having, is that even after significant training, the loss does not really decrease. After observing the output of the network, I notice that the network tends to output values close to zero, for both output nodes. As such, the prediction of the box's location is always the centre of the image. There is some deviation in the predictions, but always around zero. Below shows the loss:

enter image description here

I have run this for many more epochs than shown in this graph, and the loss still never decreases. Interestingly here, the loss actually increases at one point.

So, it seems that the network is just predicting the average of the training data, rather than learning a good fit. Any ideas on why this may be? I am using Adam as the optimizer, with an initial learning rate of 0.01, and relus as activations


If you are interested in some of my code (Keras), it is below:

# Create the model
model = Sequential()
model.add(Convolution2D(32, 5, 5, border_mode='same', subsample=(2, 2), activation='relu', input_shape=(3, image_width, image_height)))
model.add(Convolution2D(64, 5, 5, border_mode='same', subsample=(2, 2), activation='relu'))
model.add(Convolution2D(128, 5, 5, border_mode='same', subsample=(2, 2), activation='relu'))
model.add(Flatten())
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='linear'))


# Compile the model
adam = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
model.compile(loss='mean_squared_error', optimizer=adam)


# Fit the model
model.fit(images, targets, batch_size=128, nb_epoch=1000, verbose=1, callbacks=[plot_callback], validation_split=0.2, shuffle=True)
Karnivaurus
  • 7,019
  • 1
    Are the images on top examples of your actual samples? Is that 5 separate samples? There appears to be no information in the images that would help generalize. I mean, you don't need a neural net to find the x,y location of the white square, you can just parse the image and look for a white pixel. Explain a bit more about your vision for this model. Is there some temporal pattern, whereby you are predicting the next location? – photox Feb 14 '17 at 02:06
  • 1
    Hi, and yes, the images are 5 separate samples. I'm not sure how they are rendered for you, but they should be 5 individual square images (I've changed the layout a little to help...). Yes, I realise that you don't need a neural network for this task, but it is just a test experiment to help me learn how to do regression with a neural network. I don't understand what you mean by there being no information to help generalize.... Each training pair consists of a square image, and a two-dimensional vector of the (x, y) location of the square. Thanks :) – Karnivaurus Feb 14 '17 at 11:57
  • 1
  • Your input shape on the first conv layer is using 3 (rbg) channels, but your data are greyscale (1 channel)
  • You don't need that many conv layers and filters, in fact I think a single layer, and a handful of small kernels will be fine.
  • – photox Feb 14 '17 at 12:04
  • Are you sure that images do indeed correspond the targets? – user31264 Feb 14 '17 at 12:18
  • I know that I do not need 3-channels for this (the images I use are actually RGB), or so many layers, but I am just using this for a test case before applying it to more sophisticated images. – Karnivaurus Feb 14 '17 at 12:30
  • Yes, I have checked that the images and targets correspond, by drawing the target values on the images and displaying the images. – Karnivaurus Feb 14 '17 at 12:30
  • 1
    Like @photox says, you do not need the conv layers. Adding these make it more difficult for the optimizer to find a good solution. If you remove the 3 conv layers I suspect your "model" will work. – Pieter Feb 14 '17 at 22:02
  • I can’t see any image. – SmallChess Oct 02 '17 at 06:02
  • 1
    Convolutional layers help with translational invariance due to weight sharing. This doesn't help you at all. As other said before, you would get the result you expect without them. – Firebug May 16 '18 at 14:05
  • My guess would be a bug somewhere in the code we're not seeing. I implemented more or less what you had above using newer tf.keras and it trains just fine https://colab.research.google.com/drive/1mrCL2m8y50kd3WQA9IvtDl3-GwVku1gX?usp=sharing – James McKeown May 14 '21 at 20:01
  • Perhaps obvious, but a big lurking variable is image size and coordinate system. If I up from 10x10 images to 20x20 it takes way longer to train. If you have large images you'll likely need to downsample with some pooling layers. – James McKeown May 14 '21 at 20:09