18

Most networks I've seen have one or two dense layers before the final softmax layer.

  • Is there any principled way of choosing the number and size of the dense layers?
  • Are two dense layers more representative than one, for the same number of parameters?
  • Should dropout be applied before each dense layer, or just once?
Ethan
  • 1,633
  • 9
  • 24
  • 39
geometrikal
  • 533
  • 1
  • 5
  • 14
  • I am also interested in this topic. While @moh did answer some questions, one is left: - Is there any principled way of choosing the number and size of the dense layers? Reference 6 link is broken. Any alternatives? – lostdatum Feb 18 '21 at 11:55

2 Answers2

22

First of all:

There is no way to determine a good network topology just from the number of inputs and outputs. It depends critically on the number of training examples and the complexity of the classification you are trying to learn.[1]

and Yoshua Bengio has proposed a very simple rule:

Just keep adding layers until the test error does not improve anymore.[2]

Moreover:

The earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.[3]

For example, in a method for learning feature detectors:

first layer learns edge detectors and subsequent layers learn more complex features, and higher level layers encode more abstract features. [4]

So, using two dense layers is more advised than one layer.

Finally:

The original paper on Dropout provides a number of useful heuristics to consider when using dropout in practice. One of them is: Use dropout on incoming (visible) as well as hidden units. Application of dropout at each layer of the network has shown good results. [5]

in CNN, usually, a Dropout layer is applied after each pooling layer, and also after your Dense layer. A good tutorial is here [6]

References:

[1] https://www.cs.cmu.edu/Groups/AI/util/html/faqs/ai/neural/faq.html

[2] Bengio, Yoshua. "Practical recommendations for gradient-based training of deep architectures." Neural networks: Tricks of the trade. Springer Berlin Heidelberg, 2012. 437-478.

[3] http://cs231n.github.io/transfer-learning/

[4] http://learning.eng.cam.ac.uk/pub/Public/Turner/Teaching/ml-lecture-3-slides.pdf

[5] https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/

[6] https://cambridgespark.com/content/tutorials/convolutional-neural-networks-with-keras/index.html

Mo-
  • 1,255
  • 1
  • 10
  • 26
-1

1 layer gives linear approximation (can use if need linear regression)

2 & more layers provide non-linearity... e.g. 2 provide in results speed & acceleration

JeeyCi
  • 133
  • 5
  • 1 layer gives non-linearity if you count the activation function - logistic regression is a dense layer + sigmoid. 2 layers does not make things faster; it makes a more complex model. – Sean Owen May 29 '22 at 23:29
  • non-linear activation function does NOT lead to non-linearity of the resulting Approximation, because y=weight*x+bias is Linear function, that is achieved by error minimization as a result of applying Dense layer (through any activation function). Activation function just makes weightening, but resulting fuction of dependency (y=f(x)) is linear function, when 1 Dense layer used – JeeyCi May 30 '22 at 13:31
  • I'm not sure what you're trying to say, but an activation function is not a weight. In fact, composing 2 linear transformations just results in a linear transformation; the activation function is what makes this simple network setup non-linear. – Sean Owen Jun 01 '22 at 01:30
  • Activation function - "each neuron forms a weighted sum of its inputs" & then AF works with this vector... But each W at each point is just the slope of tangent (or speed). But using 2Dense -- you get W^2 further on (meaning acceleration)... and only after applying AF (depending on its nature) it gives linearity or non-linearity of the final result... -- this what I meant* – JeeyCi Jun 01 '22 at 06:23
  • BTW, depending on AF choice - depends derivative values (as is slope for every 2nd-order polinomial or W^2 part in intermediary input)... - as is for any point-curvature)... and vanishing gradient (dy/dx) problem also concerns the choice of AF -- but this is out of this question's concern – JeeyCi Jun 01 '22 at 06:29
  • concerning "any principle way of choosing the number and size of the dense layers?" - I suppose to 1st Q - depends on goals of the model - with more Dense layers can get more precise results, but models with many layers suffer from vanishing gradient & overfitting, - experiment with diff. qty of layers & see better ROC curves (as for imbalanced ds) , not precision, to summarize the performance of classifier - & choose the best model that suits your goals. To 2nd Q - with too huge output units - you can get stuck at local minimum (I didn't test) - recommended 60% of previous layer – JeeyCi Jun 03 '22 at 10:35