2

I'm testing a SVD-based collaborative filter on my data set, in which the label, $r_{ij}$, is a real value from 0 to 1.

Like the many papers suggested, to have a better performance, instead of using $ \hat{R} = U \cdot V^T $ directly, I use $\hat{R} = \mu + B_u + B_v + U \cdot V^T $, where $\mu$ is the average rating, $B_u$ is the bias of user, and $B_v$ is the bias of item.

Thus, this model corresponds to a loss function: $\min_{B_u, B_v, U, V} = ||I\circ(R-\mu-Bu-B_v-U\cdot V^T)||_F^2 + \lambda (||B_u||_F^2 + ||B_v||_F^2 + ||U||_F^2 + ||V||_F^2)$

where I is the masking matrix in which $I_{ij} = 1$ if $R_{ij}$ is known, and $||\cdot||_F$ is the frobenius norm.

Then, I solve this by gradient descent, it seems to work fine, and the test RMSE is 0.25.

However, when I investigate the contribution of each part in predict function $\hat{R} = \mu + B_u + B_v + U \cdot V^T $, I notice that, $\mu$ is about 0.5, $b_u$ and $b_i$ are about $\pm0.3$, but the part of $ U \cdot V^T $ is quite small, normally about $\pm 0.01$.

Why does this part contribute so small? Since this part is the actual part where the collaborative filter works, I expect it to contribute more in prediction.

ice_lin
  • 147
  • 6

2 Answers2

1

I think that depends on the pattern of your data, its skewness and sparsity. Because it would press these two bias term, which is designed to detect the difference with global mean.

For example, you can plot mean and std of the user's bias: ($R_u$ - global_mean) then make a histogram of std. Another way, just make up a rating matrix $R'$ according to the $\mu + B_u + B_v$ you've got. And run the solver again and see what happens. These parts play its own role, even your change on global_mean would make a difference to the final RMSE, I think the role of that latent factors are born to capture the REST non-linear pattern in user & item's interaction.

So, as for your question:

Why does this part contributes so small?

Again, you can simply using $R = UV^T$, in this time, I think it contributes quite a lot, right? The contribution of each part is hard to say.

zihaolucky
  • 141
  • 4
  • thanks, but I'm still wondering does the small contribution of $UV$ imply that, to fit my data, I actually doesn't need this non-linear part? – ice_lin Mar 28 '15 at 07:31
  • 1
    @ice_lin Hard to tell whether we need this part. If we just use global_average or hot items to recommend, we save a lot computation resource. But I still want to say, in industry, even a small changes in prediction accuracy would cause a lot! – zihaolucky Sep 24 '15 at 04:25
1

It means the average predictions of the user across all items, and the average of each item across all users accurately predicts the ratings -- you have an easy data set.

Emre
  • 10,491
  • 1
  • 29
  • 39