I'm looking at the SAMHSA Mental Health Client-Level dataset. I did some t-SNE plots (dropping irrelevant cols, normalizing some, one-hot encoding some) of 500k rows out of 6.5mil.
I'm trying to do classification, predicting the diagnosis from the other columns. I trained a RandomForestClassifier which is 36% accurate when the diagnoses are lumped into [no_disorder (1), unique_disorder (13), multi-disorder (1)], for 15 total categories. It is less accurate at 17% with every combination (2^13=8192 categories).
Is this the best classification algorithm to use? Would k-means be better?
I'm not quite sure how to interpret this confusion matrix. Can someone help out? My guess is that the remaining columns (life factors) are just not enough to predict the diagnosis, combined with messy, subjective diagnoses, i.e. I need symptom-level data.
pred_cols = ['AGE','EDUC','GENDER','SPHSERVICE','CMPSERVICE','OPISERVICE','RTCSERVICE','IJSSERVICE','SAP','VETERAN','ETHNIC --> one-hot,'RACE --> one-hot,'MARSTAT --> one-hot,'EMPLOY --> one-hot,'LIVARAG --> one-hot]
disorder_cols = ['no_disorder','DELIRDEMFLG','CONDUCTFLG','ADHDFLG','DEPRESSFLG','BIPOLARFLG','PERSONFLG','ALCSUBFLG','TRAUSTREFLG','ANXIETYFLG','SCHIZOFLG',,'OTHERDISFLG','ODDFLG','PDDFLG','multi-disorder']
EDIT: k-Means is very revealing on the t-SNE plot. num_clusters=13.

