2

ResNet consists of 25M trainable parameters. If only 30% of 600 $512 \times 512$ images is annotated, there are $600 * 512 * 512 * ~0.3 = 47,185,920$ ground truth pixels. A parameter is a floating point value of 32 bits, while an RGB colour pixel takes 24 bits. This means that the network is able to fully encode the whole ground truth into its parameters: $47,185,920 / 24 * 32 = 62,914,560$, which is more than the number of ResNet's parameters. This leads to the network hardcoding its training data instead of inferring patterns and/or features, which does not bode well for its generalization.

Is this reasoning correct? And is this what happens when training with too little ground truth?

  • 2
    have a look here: https://stats.stackexchange.com/questions/329861/what-happens-when-a-model-is-having-more-parameters-than-training-samples – eugen Oct 04 '19 at 01:19

2 Answers2

3

While it's true that someone can encode the training-set images in the weights, the direct comparison between the total number of bits in the weights and the total number of bits in the training images does not seem like a helpful way to think about capacity and memorization in deep neural networks (even if you take into account also the bits required to store the labels).

First, the fact that the weights have the theoretical capacity to memorize the dataset doesn't entail that the network can utilize it. For example, an output neuron doesn't have access to the bit level of a first layer neuron's activation (unless no summation or pooling happens). So having as many weight-bits as dataset-bits isn't a sufficient condition for memorization by deep nets. You can imagine building a nearest neighbor classifier that would fully memorize the dataset, but it doesn't mean that the CNN can implement that.

Second, one can memorize the image-label mapping using far less information than the full training dataset. Let's assume that the first two pixels of the image fully identify the image within its sample (this is likely to be a very reasonable assumption for 200 images). One can build a simple lookup table from 200 48-bit words to labels. This requires far less bits than memorizing the entire dataset (e.g. (48+8)*200 if we use 8 bit labels). Therefore, having as much weights-bits as dataset-bits isn't a neceassry condition for memorization.

So this condition is neither necessary nor sufficient.

A more practical test is to train the network on randomly assigned labels. If the network manages to achieve good training accuracy in this task, it means that it has sufficient capacity for memorizing any mapping from the training images to arbitrary labels.

Trisoloriansunscreen
  • 1,854
  • 15
  • 27
  • Thanks for your reply, very insightful! I'm doing semantic segmentation, in which case I don't think the second point holds! I tried overfitting on my dataset, and even after convergence the network still was not able to get a perfect score. So that reinforces your point that a CNN isn't capable arranging all its weights in such a way that encodes all datasets (that's what I got from it at least). I still don't quite grasp the sentence "an output neuron doesn't have access to the bit level of a first layer neuron's activation". –  Oct 05 '19 at 21:49
0

There is redundancy in the training data, so you can overfit much easier.

But on the other hand, not all of the bits are actually equally well used or trained. Much of the value range will literally never be used, and it cannot learning single bits. That is why many networks can afterwards be optimized to run faster using INT8 data types etc. - cutting down the size formally by a factor of 4.

Last but not least, there are various regularization terms used that are intended to reduce overfitting. Some of the training patches may also contradict - there is likely an entirely black square in more than one class. There is maybe also some mislabeled data.

Nevertheless, there is supposedly still a lot of overfitting going on both in these huge networks that you can download and even more so in the papers that spam all the AI conferences, the majority of which contribute nothing to advancing science because they just show that you can overfit in many different ways... Think of the puppy-snail-network. These are artifacts from overfitting.