I have read a number of tutorials and online lectures (link1: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/) and link2: the webtutorial: http://cs231n.github.io/convolutional-networks/#overview but none of them mention the rationale for selecting a particular design. How do we decide on the following design aspects? In particular the following points are unclear to me and shall be extremely grateful if somebody could walk through the concepts that will help me to design any CNN architecture.
1) If the filter size is 5*5*3 then how come the number of filters, $K=12$? Shouldn't it be 3 -- one for each channel?
2) In the second link, there is a formula - $((W-F+2P)/S)+1$
where $W$ = width of the input image, $F$ = filter size. It is not clear what is the purpose of this formula
3) Is there a rule of thumb/formula for deciding on the number of layers, number of filters, filter size, number of fully connected layers? Or is it purely on the basis of trial and error?
4) Can somebody please explain the intuition and the rationale for designing a CNN architecture for this example -- considering a binary classification problem. For an input RGB image of size 500*500*3, how would you design the architecture -- how many layers, number of filters, size of the filter, how much is the stride, etc.