While it's true that someone can encode the training-set images in the weights, the direct comparison between the total number of bits in the weights and the total number of bits in the training images does not seem like a helpful way to think about capacity and memorization in deep neural networks (even if you take into account also the bits required to store the labels).
First, the fact that the weights have the theoretical capacity to memorize the dataset doesn't entail that the network can utilize it. For example, an output neuron doesn't have access to the bit level of a first layer neuron's activation (unless no summation or pooling happens). So having as many weight-bits as dataset-bits isn't a sufficient condition for memorization by deep nets. You can imagine building a nearest neighbor classifier that would fully memorize the dataset, but it doesn't mean that the CNN can implement that.
Second, one can memorize the image-label mapping using far less information than the full training dataset. Let's assume that the first two pixels of the image fully identify the image within its sample (this is likely to be a very reasonable assumption for 200 images). One can build a simple lookup table from 200 48-bit words to labels. This requires far less bits than memorizing the entire dataset (e.g. (48+8)*200 if we use 8 bit labels). Therefore, having as much weights-bits as dataset-bits isn't a neceassry condition for memorization.
So this condition is neither necessary nor sufficient.
A more practical test is to train the network on randomly assigned labels. If the network manages to achieve good training accuracy in this task, it means that it has sufficient capacity for memorizing any mapping from the training images to arbitrary labels.