3

I'm working on the SAMHSA Mental Health Client-Level Dataset. I'm trying to train classifiers to predict the disorder given the rest of the columns. There are 13 binary disorder columns (bipolar, schizophrenia, ADHD, etc.) based on diagnoses.

Code: https://github.com/jacksonwalters/ml-examples/tree/main/mental_health_client-level_data

I've trained a RandomForestClassifier and multi-class LogisticRegression. They are 36% accuracy, 30% precision, 36% recall. I binned the disorders [0 disorders, 1 disorder, > 1 disorder]. I also tried binary encoding the 2^13=8192 combinations of disorders, which had similar accuracy but 17% precision. A random guess in the former case should be ~7%.

If I predict the k-means labels, I get ~92% accuracy, precision, recall.

For the disorder labeling, should I use LogisticRegression and drop all other columns, and perform the classification on the remaining columns? I'd do this for all 13 columns, then just divide by the sum to get 13d vector probability outputs.

The output/labels are really Boolean vectors. If someone is diagnosed with ADHD by two different doctors, they do not have 2*ADHD, they just have ADHD, so ADHD + ADHD = ADHD, i.e. idempotent.

confusion matrix logistic regression

  • 1
    Welcome to Cross Validated! What is the outcome, a list of the disorders, something like “patient 1 has ADHD, patient 2 has schizophrenia and is bipolar, patient 3 has none of the disorders, etc”? – Dave Mar 06 '24 at 14:59
  • @Dave Thank you! Excited to be here. Exactly, there are 13 disorders (listed here, bottom: https://stats.stackexchange.com/questions/641651/finding-parameters-which-reveal-clustering-in-t-sne) which are (schizophrenia, bipolar, ADHD, ODD (oppositional defiant), SUB (substance abuse), etc.), each one a separate 0/1, Boolean value. – Jackson Walters Mar 06 '24 at 15:08
  • 1
    And a patient can have one, multiple, or zero of the disorders, yes? – Dave Mar 06 '24 at 15:33
  • @Dave Yes. That was my first set of categories, resulting in 1+13+1=15. – Jackson Walters Mar 06 '24 at 15:37
  • 1
    Then it seems that you have a [tag:multilabel] problem. Perhaps do some reading on that subject and see if it makes sense in the context of your problem. – Dave Mar 06 '24 at 15:47
  • @Dave It's definitely multi-label. I'll do some more reading but my understanding was LogisticRegression handles multi-labels naturally, which I'm doing. I'm just wondering if I should really have 8192 classes, or just try to output vectors instead. – Jackson Walters Mar 06 '24 at 16:01
  • 1
    I think it would be very reasonable to model the (possible dependent) probabilities of thirteen binary variables. Multivariate probit regression might be a reasonable starting point. – Dave Mar 06 '24 at 16:19
  • @Dave Thank you, I'll look into multivariate probit. I think "drop all but one" is only a few lines of code, so will compare. – Jackson Walters Mar 06 '24 at 16:21
  • 1
    What is “drop all but one”? – Dave Mar 06 '24 at 16:25
  • @Dave I have 13 columns I want to predict for. I can drop 12 of those, and just use the remaining cols to predict a single target, say 'schizophrenia'. Then I can move to the next one, say 'bipolar'. I drop 12 columns again, this time leaving 'bipolar' but removing 'schizophrenia' and the others, and make a prediction for bipolar. I do this for all 13 columns, yielding 13 numbers. Dividing by the sum to normalize gives a vector of probabilities. – Jackson Walters Mar 06 '24 at 17:00
  • 1
    $1)$ You don't have to divide by the sum. You have thirteen probability values; those are the predicted probabilities of the patient having the disorder. $2)$ You don't have to do this one-at-a-time. The multivariate probit model will jointly model all thirteen probability values at once. This has an advantage if disorders tend to go together. That's why multi-label problems are studied instead of just modeling each label one-at-a-time. – Dave Mar 06 '24 at 17:06
  • True. 2) Right, I'll give multivariate probit a shot. I did a seaborn plot of correlations of disorders earlier. A major thrust of this project is to understand which disorders "go together". In another question on CV I look at t-SNE maps to reveal clustering, but the idea is the same. Frankly, I don't know if MDs know that the set of "disorder combos" is of size 2^13, stratified by the number of disorders, with (n choose k) for each layer with num_disorders=k.
  • – Jackson Walters Mar 06 '24 at 17:19