1

I'm looking at the SAMHSA Mental Health Client-Level dataset. I did some t-SNE plots (dropping irrelevant cols, normalizing some, one-hot encoding some) of 500k rows out of 6.5mil.

I'm trying to do classification, predicting the diagnosis from the other columns. I trained a RandomForestClassifier which is 36% accurate when the diagnoses are lumped into [no_disorder (1), unique_disorder (13), multi-disorder (1)], for 15 total categories. It is less accurate at 17% with every combination (2^13=8192 categories).

Is this the best classification algorithm to use? Would k-means be better?

I'm not quite sure how to interpret this confusion matrix. Can someone help out? My guess is that the remaining columns (life factors) are just not enough to predict the diagnosis, combined with messy, subjective diagnoses, i.e. I need symptom-level data.

multi-class confusion matrix


pred_cols = ['AGE','EDUC','GENDER','SPHSERVICE','CMPSERVICE','OPISERVICE','RTCSERVICE','IJSSERVICE','SAP','VETERAN','ETHNIC --> one-hot,'RACE --> one-hot,'MARSTAT --> one-hot,'EMPLOY --> one-hot,'LIVARAG --> one-hot]

disorder_cols = ['no_disorder','DELIRDEMFLG','CONDUCTFLG','ADHDFLG','DEPRESSFLG','BIPOLARFLG','PERSONFLG','ALCSUBFLG','TRAUSTREFLG','ANXIETYFLG','SCHIZOFLG',,'OTHERDISFLG','ODDFLG','PDDFLG','multi-disorder']

EDIT: k-Means is very revealing on the t-SNE plot. num_clusters=13.

t-SNE plot with k-means where num_clusters=13

  • 1
  • 3
    For medical diagnosis problems, the misclassification costs are unlikely to be equal. To get a useful classifier system, you should first consider what those misclassification costs might reasonably be and use minimum risk classification. – Dikran Marsupial Mar 04 '24 at 17:09
  • @StephanKolassa With respect to outputting probabilities, for now I’m just using two standard classifiers. I’m not sure I understand how to modify them. If I have a 60/40 heads/tails weighted coin, my classifier predicts heads only, after 100 flips it’s right 60 times, and is 60% accurate. Obviously I’m missing something. Fortunately these classes have more information. It’s really 13 binary values, one for each diagnosis. I suppose I could modify the model to output a list of 13 probabilities. That was my original plan with a neural net, but I’m not there yet. – Jackson Walters Mar 04 '24 at 18:51
  • @DikranMarsupial What do you mean by misclassification cost? Like in real life, a bad diagnosis? For example, if someone is misclassified with oppositional defiant disorder vs. schizophrenia the difference? – Jackson Walters Mar 04 '24 at 18:52
  • 2
    Probabilistic classifications are what I personally pretty much always recommend. Alternatively, use a loss that explicitly incorporates the differential costs of mis"classifications", as per @DikranMarsupial. Incorporating this additional information is much easier if you have probabilistic classifications in the first place, because then you can tweak the decision threshold(s). – Stephan Kolassa Mar 04 '24 at 18:56
  • 2
    @JacksonWalters yes, that is correct. Not all misclassifications have the same costs, e.g. conditions with similar or related treatments would have lower misclassification costs that two conditions with very different treatments, where giving the wrong treatment would be harmful. Error rate assumes all misclassifications are equally bad, but that is rarely true for medical applications. – Dikran Marsupial Mar 04 '24 at 18:56
  • 1
    Fully agree with @StephanKolassa, especially if you don't know what the misclassification costs at training time or if they are variable in operation. Discrete (non-probabilistic) classifiers can (in a minority of cases) give better classification, but probabilistic classifiers are more flexible and more difficult to get wrong. – Dikran Marsupial Mar 04 '24 at 18:58
  • Okay, well thank for the feedback. I’m going to move towards a probabilistic model. How on Earth would I find out the misclassification cost? Would that be a symmetric 13x13 matrix of non-negative reals? – Jackson Walters Mar 04 '24 at 19:05
  • 1
    @JacksonWalters, yes, it will be a matrix as you suggest. The best thing to do would be to talk to a suitably qualified medical professional. If it is for a training exercise, then you could generate some example matrices based on your intuition/reasoning and see how assumptions affect the loss matrix and then the minimum risk classifications and the expected loss. It can be difficult to come up with real misclassification costs, but it is a very important that they are discussed with the medics. It is basically making sure you are asking the right question of the statistical model. – Dikran Marsupial Mar 04 '24 at 19:09
  • 1
    Sounds good. I will try to include them, at least the code required to incorporate that matrix of information. There should be a default case like “all 1’s” consistent with the base case, then more info would just generalize it. I have shared the t-SNE plot with one MD but in general I’ve never professionally collaborated with any. That would be cool. – Jackson Walters Mar 04 '24 at 19:15

0 Answers0