2

These data are from SAMHSA, Mental Health Client-Level Data. I am trying to find the right parameters to obtain clustering as in this paper. Code here.

For now, I'm dropping columns which aren't disorders. I expect life factors (employment, insurance, housing status, etc.) to be correlated.

I'm using this tool to understand t-SNE better. I'm playing the perplexity.

num_data_points=500,000 (6.5mil total), num_cols = 50, num_iters = 500, perplexity = 50

t-SNE plot of mental health disorders given life factors

EDIT 1

The first six eigenvalues of $X_c^T X_c / n$ where $X_c = X - \mu$ and $n$ is the number of samples are:

0.674496730896796
0.6745034759314714
0.4741462086560067
0.47415095016150915
0.4124591388160532
0.4124632633990249

PCA plot of mental health disorders

Is there any trick to reveal more clustering? Would k-means be better?

*EDIT 2

The k-means cluster labeling is very revealing. num_clusters=13 for now:

enter image description here


pred_cols = ['AGE','EDUC','GENDER','SPHSERVICE','CMPSERVICE','OPISERVICE','RTCSERVICE','IJSSERVICE','SAP','VETERAN','ETHNIC --> one-hot,'RACE --> one-hot,'MARSTAT --> one-hot,'EMPLOY --> one-hot,'LIVARAG --> one-hot]

disorder_cols = ['no_disorder','DELIRDEMFLG','CONDUCTFLG','ADHDFLG','DEPRESSFLG','BIPOLARFLG','PERSONFLG','ALCSUBFLG','TRAUSTREFLG','ANXIETYFLG','SCHIZOFLG',,'OTHERDISFLG','ODDFLG','PDDFLG','multi-disorder']

  • 1
    Did you try k-means clustering? Or pca + k-means clustering? – Stef Mar 04 '24 at 16:10
  • @Stef I did try PCA, 2-components. The result wasn’t terribly enlightening but I’ll include a plot. I haven’t done k-means yet. – Jackson Walters Mar 04 '24 at 18:54
  • 1
    When you run PCA you get a score for each component; the sum of the scores add up to 100%. What was the sum of the scores of the first 2 components? Try k-means, with and without pca. – Stef Mar 04 '24 at 19:23
  • @Stef There is a relationship between SVD and PCA: https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca. The first two eigenvalues are both ~0.6745. I'll try k-means next. – Jackson Walters Mar 04 '24 at 20:30
  • 1
    67% is not very high. You should try keeping more components. – Stef Mar 04 '24 at 20:47
  • @Stef What do you mean by keeping components? I'm making 2d plots, so those only use the first two components anyways, right? I'm getting some really neat results with k-means labeling for the t-SNE plot (updated above). – Jackson Walters Mar 04 '24 at 21:04
  • 1
    Clustering should use the fully informative data and not limit itself to the displayed data. High dimensional data cannot be faithfully represented in low dimensions. The display dimensions should have no bearing on the issue of how many components to use. – micans Mar 07 '24 at 13:44
  • @micans The t-SNE and k-means (the displayed data in the 2d plot) are using all the data. I don't understand what you mean. PCA writes the vectors in a new basis with decreasing singular values. The PCA and t-SNE are completely separate - I am not feeding in PCA data into t-SNE. If I keep, say, 3 PCA components, my plot would need to be 3d. – Jackson Walters Mar 07 '24 at 14:11
  • 1
    You can cluster using 3 or more PCA components and display the data in 2d. t-SNE is just a display tool. It is possible to use the labels of any type of clustering algorithm and use those labels to visualise the clustering in the t-SNE plot. The two (display versus clustering) solve different problems. – micans Mar 07 '24 at 14:18
  • @micans I'm still confused. If I have 3 or more PCA components, and I apply a clustering algorithm such as k-means, I just get labels, so I have high-dimensional labeled data. Which map are you suggesting I use to project this data to 2d? Further, t-SNE is not just a display tool, it is such a map from high dim'l spaces to 2d. It works by putting Gaussian distributions on the points and minimizing the KL-divergence between the distribution in the small space and big space. I'm aware you can use labeling of any type - I have chosen k-means. – Jackson Walters Mar 07 '24 at 14:37
  • 1
    You can project the high-dimensional data using whichever way you like (PCA, UMAP, t-SNE), and then colour the data with labels you obtained in any other way you like (some form of clustering presumably). I would consider "a map from high-dimensional spaces to 2d" a visualisation tool, but it does not really matter. What does matter is that the projection onto the 2d space and the clustering should be, I strongly feel, independent processes both acting on the high-dimensional data or a good approximation of it - clustering should almost never be done on a 2-dimensional reduction. – micans Mar 07 '24 at 16:15
  • @micans I really think you don't understand what I did, because I did exactly what you're describing - the labeling was k-means done on the full, high dimensional dataset, then t-SNE was used to give a 2d embedding, then the pre-determined labeled were appended to the t-SNE, then it was plotted. Perhaps take a look at the code. – Jackson Walters Mar 07 '24 at 17:21
  • 1
    I responded to (a) "What do you mean by keeping components? I'm making 2d plots, so those only use the first two components anyways, right?", which in turn was in response to (b) "67% is not very high. You should try keeping more components." (b) was about clustering, to which you responded with (a) a remark about 2d plots, so I thought it worthwhile to make sure the two aspects were not conflated. – micans Mar 07 '24 at 17:39
  • @micans That's fine, but for "keeping more components", yes, there are many PCA components. One cannot visualize them directly (https://plotly.com/python/pca-visualization/). Are you suggesting labeling using the PCA components? I suppose that would amount to a basis change, then doing k-means vs. what I'm doing now, labeling using k-means in the standard basis (before doing either PCA or t-SNE). – Jackson Walters Mar 07 '24 at 17:48
  • 1
    Sorry, our exchanges were a mix of clarification and confusion. I just wanted to make sure that clustering and display were clearly separated. People have sometimes used the 2d representation to cluster, and (a) mentioned earlier made me worry about that. If that's not the case, all good. I have no specific recommendations for clustering. – micans Mar 07 '24 at 18:40
  • 1
    @micans Also sorry, yes a mixture of clarification and confusion. I think we're on the same page - labeling before mapping to 2d, NOT clustering on the 2d data. In this case, the k-means just did a really good job labeling the clusters, I think. – Jackson Walters Mar 07 '24 at 19:02
  • 1
    Beware of the issues of t-SNE: https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne – Erich Schubert Mar 14 '24 at 10:51
  • @ErichSchubert I did the clustering before t-SNE, not after. – Jackson Walters Mar 14 '24 at 16:10

0 Answers0