I am exploring the possibility of using a deep autoencoder neural net to build a recommender system. I am firstly testing the model's performance on the traditionally used benchmark of the Movielens data set. I have pivoted the data set into a matrix of size U x M where U is the number of users and M is the number of movies. Each cell R_u,m is the rating (1-5) that user u gave to movie m, provided that they have rated that movie. If not, the cell is populated with a 0. Obviously, each user has only rated a very small proportion of all movies in the data set, so it is predominantly populated with zeros.
I am training the autoencoder by feeding the data set in to it in batches of rows, encoding each row in to a dimension of k < M, and then reconstructing it back to a dimension of M, and using the original input as the label in backprop - but only calculating the loss on the non-zero values of the input.
The idea is that the model will reconstruct the original non-zero values accurately, and hence will populate all the original zero values with accurate predicted ratings.
It learns quite quickly to what seems to be an impressive loss value, but of course this could be completely overfitting. I now want to do some testing on unseen data, but am having difficulty figuring out how to divide the data set into training and testing sets.
Any recommendations on how to correctly train and test a model on this type of data?
What first came to mind is the following:
- All rows in the
originalmatrix have at least 20 non-zero values - Set 10 randomly chosen of these non-zero values to 0, call this new matrix
input - Feed this
inputmatrix into the model for training, still calculating the loss only on the non-zero values - For testing, feed
originalinto the model, but now only calculate the loss only on the 10 values per row that were randomly set to 0.