4

I've coded up my own neural network that I'm experimenting with. I'm curious about the implementation of L2 regularization I've seen in most literature.

Generally a weight update looks something like this:

$$w_i^{\prime} = w_i - \eta \frac{\partial C}{\partial w_i} - \frac{\eta \lambda}{n}w_i$$

That is: The weight update is the previous weight, minus a step in the gradient direction, minus the L2 regularization term. Where $w$ are the weights, $C$ is the cost function, $\eta$ is the learning rate, $n$ is the dataset size, and $\lambda$ is the tuneable regularization parameter.

I feel uneasy about the way the L2 regularization term is calculated: $\frac{\eta \lambda}{n}w_i$

$\eta$ makes sense, and I'm comfortable with $\lambda$ as a tuneable parameter. But dividing by the dataset size $n$ concerns me. This means that the entire regularization effect is dependent on the input dataset. Why wouldn't I just define a simple hard coded percentage, such as: "reduce the weights by 0.1%"?

The dependency on the dataset size seems cumbersome as lambda has to be adjusted whenever the dataset size is changed. I'm often using smaller subsets of the main dataset for quick trial and error before training on the full dataset. I might also try continuously sampling from random perterbations of the original dataset, rendering $n$ rather ambiguous.

Incidentally I'm using minibatch gradient descent for training, with batch sizes typically from tens to thousands.

My question:

Is there a good reason to use $n$ as shown here over subtracting off a small fixed percentage of the weights? E.g. eliminating $n$ from the regularization. What am I not thinking of here? Why is this common?

David Parks
  • 1,617
  • 1
  • 13
  • 18
  • 1
    I think $ n $ is not batch size (or dataset size) but rather number of parameters in $ w_i $. In this way $ \lambda $ is stabilized among different choices of architecture. Can you show any source where $ n $ is denoted for sample size? – yasin.yazici Feb 22 '16 at 06:20
  • See, now that makes perfect sense to me, or treating lambda as a simple small percentage should do the same. But I'm going through this book: http://neuralnetworksanddeeplearning.com/chap3.html#handwriting_recognition_revisited_the_code, which I consider extremely well written. The regularization term is dependent on training set size. If I remember correctly (a debatable point) Andrew Ng's coursera class did the same. Which brought me to wonder if I was missing something important. – David Parks Feb 22 '16 at 18:53

0 Answers0