tsne for prediction

Question

I have a traditional prediction setting, with a training data set train and a test data set test. I do not know the outcome y of the test set.

I found that tsne separates my binary classification setting quite well. However, tsne cannot really be used for prediction, as in predict(tsne, newdata=test) which can be done for PCA.

What is the best approach here?

Should I combine my train and test set (i.e., rbind) and run tsne on the whole data set?

Do you have the outcome y for the training set? The usual approach is to split your training set, and measure your performance against labeled data. You then have some confidence regarding quality of later predictions (when you do not have label). What is the purpose of your testing? Is it to measure the performance of the model, or for something else? — Neil Slater, Aug 21 '17 at 09:57
@NeilSlater I have y for test. However, I do not want to use it. tnse is just a unsupervised dimension reduction algorithm that does not need outcome at all. — spore234, Aug 21 '17 at 10:46
That is normal use case for test data. You should make that clear in your question. There is a big difference between "not using test_y to train the algorithm" and "not knowing test_y". — Neil Slater, Aug 21 '17 at 10:56

score 2 · Answer 1 · edited Dec 23 '17 at 16:05

Here's an approach:

Get the lower dimensional embedding of the training data using t-SNE model.
Train a neural network or any other non-linear method, for predicting the t-SNE embedding of a data point. This will essentially be a regression problem.
Use the model trained in step 2 to first predict the t-SNE embedding of a test data point and then assign it to a class using kNN.

This ensures that there is no data leakage between your training and test set. I believe this approach is a hacky way of bringing t-SNE into the binary classification picture.

Do note that t-SNE was mainly intended for visualization of high dimensional data points and not to extract good features for a classification model. The fact that you could observe a clear separation between classes using the t-SNE visualisation implies that the data can be easily modeled as a binary classification task using a non-linear classification algorithm.

If I were you, I would consider using ELMs, SVMs with non-linear kernels or good old Logistic Regression with regularisation.

score 1 · Answer 2 · answered Aug 21 '17 at 12:03

You could try performing t-SNE on the combined test and train data, then assigning the class of each test point based on $k$-NN with the training data in t-SNE coordinates.

However, I don't believe there are any guarantees with t-SNE, so you might need to find another method. My question on the possibility of using $k$-NN with t-SNE is here.

score 1 · Answer 3 · answered Dec 19 '17 at 16:35

t-SNE is not really designed that way. Since t-SNE is non-parametric there isn't a function that maps data from an input space to the map. The standard approach usually is to train a multivariate regression to predict the map location from input data. You can read more about this in this paper t-SNE. In the paper you should note that the author takes the approach to minimize the t-SNE loss directly.

tsne for prediction

3 Answers3