These data are from SAMHSA, Mental Health Client-Level Data. I am trying to find the right parameters to obtain clustering as in this paper. Code here.
For now, I'm dropping columns which aren't disorders. I expect life factors (employment, insurance, housing status, etc.) to be correlated.
I'm using this tool to understand t-SNE better. I'm playing the perplexity.
num_data_points=500,000 (6.5mil total), num_cols = 50, num_iters = 500, perplexity = 50
EDIT 1
The first six eigenvalues of $X_c^T X_c / n$ where $X_c = X - \mu$ and $n$ is the number of samples are:
0.674496730896796
0.6745034759314714
0.4741462086560067
0.47415095016150915
0.4124591388160532
0.4124632633990249
Is there any trick to reveal more clustering? Would k-means be better?
*EDIT 2
The k-means cluster labeling is very revealing. num_clusters=13 for now:
pred_cols = ['AGE','EDUC','GENDER','SPHSERVICE','CMPSERVICE','OPISERVICE','RTCSERVICE','IJSSERVICE','SAP','VETERAN','ETHNIC --> one-hot,'RACE --> one-hot,'MARSTAT --> one-hot,'EMPLOY --> one-hot,'LIVARAG --> one-hot]
disorder_cols = ['no_disorder','DELIRDEMFLG','CONDUCTFLG','ADHDFLG','DEPRESSFLG','BIPOLARFLG','PERSONFLG','ALCSUBFLG','TRAUSTREFLG','ANXIETYFLG','SCHIZOFLG',,'OTHERDISFLG','ODDFLG','PDDFLG','multi-disorder']


