I was looking at code and found this:
model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
I was keen to know about kernel_initializer but wasn't able to understand it's significance?
I was looking at code and found this:
model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
I was keen to know about kernel_initializer but wasn't able to understand it's significance?
The neural network needs to start with some weights and then iteratively update them to better values. The term kernel_initializer is a fancy term for which statistical distribution or function to use for initialising the weights. In case of statistical distribution, the library will generate numbers from that statistical distribution and use as starting weights.
For example in the above code, normal distribution will be used to initialise weights. You can use other functions (constants like 1s or 0s) and distributions (uniform) too. All possible options are documented here.
Additional explanation: The term kernel is a carryover from other classical methods like SVM. The idea is to transform data in a given input space to another space where the transformation is achieved using kernel functions. We can think of neural network layers as non-linear maps doing these transformations, so the term kernels is used.
In the activation function, the initialization is important to train the neural network. I pasted the suggestion about kernel from DeeplearingAI webpage.
Initializing kernel weights with inappropriate values will lead to divergence or a slow-down in the training of your neural network. The illustrated the exploding/vanishing gradient problem with simple symmetrical weight matrices, the observation generalizes to any initialization values that are too small or too large.
Case 1: A too-large initialization leads to exploding gradients
Consider the case where every kernel weight is initialized slightly larger than the identity matrix like: \begin{bmatrix}1.5 & 0 \\ 0 & 1.5\end{bmatrix} When activation is used in backward propagation, this leads to the exploding gradient problem. That is, the gradients of the cost with the respect to the parameters are too big. This leads the cost to oscillate around its minimum value.
Case 2: A too-small initialization leads to vanishing gradients
Similarly, consider the case where every kernel weight is initialized slightly smaller than the identity matrix.\begin{bmatrix}0.5 & 0 \\ 0 & 0.5\end{bmatrix}
When these activation is used in backward propagation, this leads to the vanishing gradient problem. The gradients of the cost with respect to the parameters are too small, leading to convergence of the cost before it has reached the minimum value.