1

Let's say I run an Ordinary Least Square regression with a Ridge regression on 100.000 points randomly sampled from a huge dataset. The best regularization strength found is C=1.

What is approximately the optimal regularization strength I can expect if I run the same algorithm on 1.000.000 points from the same dataset ?

Are there general rules that link the optimal regularization strength and the problem size ? Do these rules rely on statistical assumptions ? / What is their robustness ?

Thanks

mbl
  • 9

1 Answers1

1

In general, if you multiply the number of data points, $n$, by $s$ while leaving the number of predictors, $p$, unchanged, then $\|X\beta-y\|_{2}^{2}$ will increase by roughly a factor of $s$, while $\| \beta \|_{2}^{2}$ won't change much. To keep the same balance between the misfit term and the regularization term, you'll have to increase the regularization parameter by a factor of $s$ (or $\sqrt{s}$ if your regularization parameter is squared in the objective function.)

Although this is a good rule of thumb and a reasonable starting point for searching for a regularization parameter, you generally shouldn't just set the parameter by this rule of thumb. Rather, you should use whatever method you're using to select the regularization parameter on the larger data set.

  • Using more and more points, I can reach the situation where overfitting in negligible. In such a situation the optimal C is small, at least not linear with the problem size. – mbl Oct 12 '15 at 14:51
  • Yes- my suggestion was that you don't just scale linearly with the the number of data points. – Brian Borchers Oct 12 '15 at 15:17