I've coded up my own neural network that I'm experimenting with. I'm curious about the implementation of L2 regularization I've seen in most literature.
Generally a weight update looks something like this:
$$w_i^{\prime} = w_i - \eta \frac{\partial C}{\partial w_i} - \frac{\eta \lambda}{n}w_i$$
That is: The weight update is the previous weight, minus a step in the gradient direction, minus the L2 regularization term. Where $w$ are the weights, $C$ is the cost function, $\eta$ is the learning rate, $n$ is the dataset size, and $\lambda$ is the tuneable regularization parameter.
I feel uneasy about the way the L2 regularization term is calculated: $\frac{\eta \lambda}{n}w_i$
$\eta$ makes sense, and I'm comfortable with $\lambda$ as a tuneable parameter. But dividing by the dataset size $n$ concerns me. This means that the entire regularization effect is dependent on the input dataset. Why wouldn't I just define a simple hard coded percentage, such as: "reduce the weights by 0.1%"?
The dependency on the dataset size seems cumbersome as lambda has to be adjusted whenever the dataset size is changed. I'm often using smaller subsets of the main dataset for quick trial and error before training on the full dataset. I might also try continuously sampling from random perterbations of the original dataset, rendering $n$ rather ambiguous.
Incidentally I'm using minibatch gradient descent for training, with batch sizes typically from tens to thousands.
My question:
Is there a good reason to use $n$ as shown here over subtracting off a small fixed percentage of the weights? E.g. eliminating $n$ from the regularization. What am I not thinking of here? Why is this common?