Matrix Factorization and Overfitting

Question

I recently came accross the algorithm of Matrix Factorization for a recommendations system.

One of the tutorials I followed can be found here.

According to it given the initial matrix $R$ and the goal to factor it into two matrices $P$ and $Q$ with $k$ latent features the error for each known entry is calculated using the following formula in order to avoid overfitting :

$$e_{ij}^2 = (r_{ij} - \sum_{k=1}^K{p_{ik}q_{kj}})^2 + \frac{\beta}{2} \sum_{k=1}^K{(||P||^2 + ||Q||^2)}$$

What I don't understand is :

Why add the $\frac{\beta}{2} \sum_{k=1}^K{(||P||^2 + ||Q||^2)}$ part to the squared error? It seems a bit arbitrary
How does it prevent overfitting?

The basic idea is that constraining the solution to have smaller magnitude weights gives us a dial that we can turn up or down control the extent of any overfitting. See: https://stats.stackexchange.com/questions/188092/why-do-smaller-weights-result-in-simpler-models-in-regularization — Sycorax, May 14 '22 at 02:25

score 1 · Answer 1 · edited May 17 '22 at 02:21

The idea prevalent in recommender systems is to find a low rank approximation of your matrix. The matrix factorization you describe gives rise to a low-rank approximation (at least when $k$ is relatively little of course...) . From this paper (and references therein) you can learn of the different steps leading to this $||P||^2 + ||Q||^2$ regularization. It comes as an equivalent form of the nuclear norm, which is a convex surrogate of the rank (in fact its convex envelope on bounded matrices). To help you further, you can note this other reference, which gives a proof (Lemma 1)

Matrix Factorization and Overfitting

1 Answers1