3

I recently came accross the algorithm of Matrix Factorization for a recommendations system.

One of the tutorials I followed can be found here.

According to it given the initial matrix $R$ and the goal to factor it into two matrices $P$ and $Q$ with $k$ latent features the error for each known entry is calculated using the following formula in order to avoid overfitting :

$$e_{ij}^2 = (r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2 + \frac{\beta}{2} \sum_{k=1}^K{(||P||^2 + ||Q||^2)}$$

What I don't understand is :

  • Why add the $\frac{\beta}{2} \sum_{k=1}^K{(||P||^2 + ||Q||^2)}$ part to the squared error? It seems a bit arbitrary

  • How does it prevent overfitting?

  • 1
    The basic idea is that constraining the solution to have smaller magnitude weights gives us a dial that we can turn up or down control the extent of any overfitting. See: https://stats.stackexchange.com/questions/188092/why-do-smaller-weights-result-in-simpler-models-in-regularization – Sycorax May 14 '22 at 02:25

1 Answers1

1

The idea prevalent in recommender systems is to find a low rank approximation of your matrix. The matrix factorization you describe gives rise to a low-rank approximation (at least when $k$ is relatively little of course...) . From this paper (and references therein) you can learn of the different steps leading to this $||P||^2 + ||Q||^2$ regularization. It comes as an equivalent form of the nuclear norm, which is a convex surrogate of the rank (in fact its convex envelope on bounded matrices). To help you further, you can note this other reference, which gives a proof (Lemma 1)

jbowman
  • 38,614
Hugo B
  • 56