Cold starts in factorization - WALS projections

Question

I read here (Google's crash course on recommendations) the following:

Given a new item $i_0$ not seen in training, if the system has a few interactions with users, then the system can easily compute an embedding $v_{i_0}$ for this item without having to retrain the whole model. The system simply has to solve the following equation or the weighted version:

$\underset{v_{i_0} \in \mathbb{R}^d} {\min} \| A_{i_0} - Uv_{i_0}\| $

The preceding equation corresponds to one iteration in WALS: the user embeddings are kept fixed, and the system solves for the embedding of item . The same can be done for a new user.

But if the item was truly not seen in training, how can the system still compute an embedding for it? If there is no data at all for this item, wouldn't the embedding be a 0 vector?

Also, they seem to cover this approach in the context of items not seen in training, but would it work for (new) users as well?

score 4 · Accepted Answer · answered May 26 '20 at 14:32

The key idea here is that it's possible to avoid a full retraining of the model as new items and users show up in the data, instead you can incrementally learn new vectors for them.

Imagine the following:

You've trained your recommender system on a big batch of data and deployed it into production. Later on a new item $i_0$ is added, that wasn't around when you did your initial training. They're describing a technique to learn an embedding for $i_0$ without having to retrain your entire model.

In order for this technique to work, some users must have interacted with item $i_0$, otherwise as you mentioned there's nothing to learn.

The same approach can work if a new user appears in the data. Again the new user must interact with some items before you can learn a useful representation for them.

The big advantage here is computational efficiency, in order to compute the vector for the $i_0$ you only need to look at the vectors for the users who interacted with it.

In a real world setting you might use this technique to keep the model up to date (doing this partial retraining say hourly or daily), and the doing a full retrain once a week or something.

+1 Thanks Max! And why is this called a WLS projection? Separately, from what I gather and what you wrote above, couldn't that lead theoretically (over time) to a suboptimal embedding with respect to the original cost function than retraining? — Josh, May 26 '20 at 17:25
For the second question, yes, this solution will likely to worse than retraining the entire model. The tradeoff here is between accuracy and training time. Retraining the whole model from scratch should lead to optimal performance, but may be slow and costly. That's why I mentioned the strategy of doing frequent partial re-training, and occasional full re-training. — Max S., May 26 '20 at 21:38

Tim · Answer 2 · 2020-05-27T06:57:48.520

Max S. is correct, the answers to your question are also given in the quote

if the system has a few interactions with users, then the system can easily compute an embedding $v_{i_0}$ for this item without having to retrain the whole model [...] same can be done for a new user

Adding to this, not only you can easier update the model, by finding the embeddings only for the new item, rather then the full model, but you can also initialize the embeddings in more clever way then zeros, or random values. Instead, you could start with a "prior" guess for the embeddings, e.g. when you are recommending movies and new sequel of Star Wars enters movie theatres, you can make an educated guess that the embeddings for this movie would be similar to other parts of Star Wars, so you could start at something like embedding for previous part, or average of the embeddings, etc. and then re-train when you have the data. It the movie is not a sequel, you can still do something similar, e.g. initialize the embeddings with something like "average embedding for romantic comedies" and use it as an educated guess before you gather the data.

You can also check another question, asked by me, and answered by Max S., on cold start in matrix factorization.

Thanks Tim +1. Leaving perhaps here the same questions I asked Max - Why is this called a WLS projection? And most importantly, couldn't that lead theoretically and in principle to a suboptimal embedding with respect to the original cost function than retraining? Or is this approach mathematically equivalent to re-training with the full dataset? — Josh, May 26 '20 at 17:47
@Josh Max answered your first question. The name is WALS for Weighted Alterating Least Squares see Koren et al (2009) for details. It is also covered on the course you are reffering to https://developers.google.com/machine-learning/recommendation/collaborative/matrix Btw, try not to asking multiple questions, or follow up questions, but rather ask new questions. — Tim, May 26 '20 at 21:46

Cold starts in factorization - WALS projections

2 Answers2

Linked