If I have a dataset with hundreds of samples and thousands of features, and t-SNE does a good job of separating classes compared to others classifiers, I don't understand why I can't rerun the algorithm with an additional test sample and predict its class based on KNN. Even if the model is non-parametric, can't the robustness of the method in generating coherent clusters time after time (different seeds or omitting some of the training samples) be used as an argument for using this method empirically?
Asked
Active
Viewed 897 times
2
-
Because it isn't strictly a supervised learning algorithm per-se, it is an embedding, i.e., projecting high dimensions into a restricted space. – patagonicus Apr 12 '22 at 08:54
-
ok but if in practice it makes a good classifier (as assessed by leave-one-out cross-validation)? – SebDL Apr 12 '22 at 09:37
-
In practice there are no labels in the training data inherently, so learning theory, i.e., PAC or similar, supervised learning doesn't apply. – patagonicus Apr 12 '22 at 10:15
-
1Some related questions: (1) https://stats.stackexchange.com/questions/238538/are-there-cases-where-pca-is-more-suitable-than-t-sne/249520 (2) https://stats.stackexchange.com/questions/398734/how-can-t-sne-or-umap-embed-new-test-data-given-that-they-are-nonparametric – Sycorax Apr 12 '22 at 12:21
-
Nothing is wrong! t-SNE works actually pretty well as data pre-processing for classifiers like KNN, DBSCAN or modeling algorithm like ANN. – James LI Apr 12 '22 at 23:08
-
1What do you do for out-of-sample data? – Dave Apr 12 '22 at 23:21
-
You can use an neural network to learn the t-SNE map; and then apply to neural network to new data. – James LI Apr 13 '22 at 02:21
-
For out-of-sample data I merge it, each datapoint individually, with the training set and rerun the t-SNE. Then look where it clusters: I apply k-NN on the 2D t-SNE embedding and predict the class of the oos datapoint. – SebDL Apr 26 '22 at 14:33
1 Answers
2
t-SNE gives no function for embedding out-of-sample data in the low-dimensional space. Consequently, all of the usual machine learning notions about out-of-sample performance are out.
If you use a different dimension reduction approach, such as UMAP or PCA, and then develop a functioning model based on that reduced dimensionality, that’s fine.
Addressing your exact question about KNN, if you use KNN after a dimension reduction, you’re still using KNN as the classifier, not an unsupervised method.
Dave
- 62,186