Fundamental questions on CNN and MLP in general

Question

I have read a number of tutorials and online lectures (link1: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/) and link2: the webtutorial: http://cs231n.github.io/convolutional-networks/#overview but none of them mention the rationale for selecting a particular design. How do we decide on the following design aspects? In particular the following points are unclear to me and shall be extremely grateful if somebody could walk through the concepts that will help me to design any CNN architecture.

1) If the filter size is 5*5*3 then how come the number of filters, $K=12$? Shouldn't it be 3 -- one for each channel?

2) In the second link, there is a formula - $((W-F+2P)/S)+1$
where $W$ = width of the input image, $F$ = filter size. It is not clear what is the purpose of this formula

3) Is there a rule of thumb/formula for deciding on the number of layers, number of filters, filter size, number of fully connected layers? Or is it purely on the basis of trial and error?

4) Can somebody please explain the intuition and the rationale for designing a CNN architecture for this example -- considering a binary classification problem. For an input RGB image of size 500*500*3, how would you design the architecture -- how many layers, number of filters, size of the filter, how much is the stride, etc.

you have asked a lot of questions. try to break them up into separate posts. in that way you may get better answers. — Haitao Du, Sep 16 '18 at 08:09
Possible duplicate of Is building deep learning architectures a trial and error scheme? — Jan Kukacka, Sep 16 '18 at 08:14
I think 3 and 4 have already answers here and in linked threads. Other than that, please always post a single question per thread. — Jan Kukacka, Sep 16 '18 at 08:15
The formula in 2 computes the width of the output of a convolutional layer with input width W, filter size F, padding P and stride S. — Jan Kukacka, Sep 16 '18 at 08:17

score 2 · Accepted Answer · answered Sep 16 '18 at 18:31

I'm not sure where you found the $5 \times 5 \times 3$. Maybe it's the size of the input image. In that case the input image does have 3 channels, but we can select how many filters we want the layer to have arbitrarily. Let's say we select $K=12$ filters. Then the output of this layer will have $12$ filters. This way, the next layer will see $12$ channels as its input.
This formula is used to calculate the number of parameters a convolutional layer will have.
No unfortunately there is no such rule of the thumb. Also the well established state-of-the-art networks don't follow any sort of architecture (e.g. see inception, resnet, vgg. They are all very different architecturally). If you want to create your own I'd suggest taking an established network (e.g. ResNet-50 and tweaking it out a bit).
Same as before. One thing you could try to do is follow the pretty basic scheme of [conv (+relu) -> conv (+relu) -> max pool] repeat this 2-3 times -> flatten -> fc (+relu) -> dropout -> fc (2 neurons and softmax activation). This, is a decent model which doesn't take much memory and runs relatively fast (compared to more complex CNNs), but can't reach state of the art performance. If you want better results stick with fine-tuning a pre-trained state-of-the-art model.

thank you for your anser. About point1) 5 x 5 x 3 is the filter size for an image of size 32 x 32 x3. Since the image has 3 channels, should there be 3 filters for each channel? Otherwise, how would the convolution take place for an RGB image? There should be a separate filter for each channel. How K=12?This part is unclear to me, shall be grateful for a clarification — Srishti M, Sep 16 '18 at 21:33
I'm confused about the 5x5x3 part. Is it a 3D CNN? Kernels are usually 2D (e.g. 5x5) that process each channel separately and sum the outputs... — MzdR, Sep 16 '18 at 22:19
The image is RGB of size 32323 so many tutorials say that the filter should have to account for the channels which are 3. — Srishti M, Sep 17 '18 at 04:25
A typical 2D convolutional filter is applied to each map separately. If you read the stanford tutorial you posted, it has a great numerical example showing how two feature maps ($K=2$) are computed from a 7x7x3 input image. — MzdR, Sep 17 '18 at 19:55

Fundamental questions on CNN and MLP in general

1 Answers1