How to encode multiple Boolean targets for logistic regression

Question

I'm working on the SAMHSA Mental Health Client-Level Dataset. I'm trying to train classifiers to predict the disorder given the rest of the columns. There are 13 binary disorder columns (bipolar, schizophrenia, ADHD, etc.) based on diagnoses.

Code: https://github.com/jacksonwalters/ml-examples/tree/main/mental_health_client-level_data

I've trained a RandomForestClassifier and multi-class LogisticRegression. They are 36% accuracy, 30% precision, 36% recall. I binned the disorders [0 disorders, 1 disorder, > 1 disorder]. I also tried binary encoding the 2^13=8192 combinations of disorders, which had similar accuracy but 17% precision. A random guess in the former case should be ~7%.

If I predict the k-means labels, I get ~92% accuracy, precision, recall.

For the disorder labeling, should I use LogisticRegression and drop all other columns, and perform the classification on the remaining columns? I'd do this for all 13 columns, then just divide by the sum to get 13d vector probability outputs.

The output/labels are really Boolean vectors. If someone is diagnosed with ADHD by two different doctors, they do not have 2*ADHD, they just have ADHD, so ADHD + ADHD = ADHD, i.e. idempotent.

Welcome to Cross Validated! What is the outcome, a list of the disorders, something like “patient 1 has ADHD, patient 2 has schizophrenia and is bipolar, patient 3 has none of the disorders, etc”? — Dave, Mar 06 '24 at 14:59
@Dave Thank you! Excited to be here. Exactly, there are 13 disorders (listed here, bottom: https://stats.stackexchange.com/questions/641651/finding-parameters-which-reveal-clustering-in-t-sne) which are (schizophrenia, bipolar, ADHD, ODD (oppositional defiant), SUB (substance abuse), etc.), each one a separate 0/1, Boolean value. — Jackson Walters, Mar 06 '24 at 15:08
And a patient can have one, multiple, or zero of the disorders, yes? — Dave, Mar 06 '24 at 15:33
@Dave Yes. That was my first set of categories, resulting in 1+13+1=15. — Jackson Walters, Mar 06 '24 at 15:37
Then it seems that you have a [tag:multilabel] problem. Perhaps do some reading on that subject and see if it makes sense in the context of your problem. — Dave, Mar 06 '24 at 15:47
@Dave It's definitely multi-label. I'll do some more reading but my understanding was LogisticRegression handles multi-labels naturally, which I'm doing. I'm just wondering if I should really have 8192 classes, or just try to output vectors instead. — Jackson Walters, Mar 06 '24 at 16:01
I think it would be very reasonable to model the (possible dependent) probabilities of thirteen binary variables. Multivariate probit regression might be a reasonable starting point. — Dave, Mar 06 '24 at 16:19
@Dave Thank you, I'll look into multivariate probit. I think "drop all but one" is only a few lines of code, so will compare. — Jackson Walters, Mar 06 '24 at 16:21
@Dave I have 13 columns I want to predict for. I can drop 12 of those, and just use the remaining cols to predict a single target, say 'schizophrenia'. Then I can move to the next one, say 'bipolar'. I drop 12 columns again, this time leaving 'bipolar' but removing 'schizophrenia' and the others, and make a prediction for bipolar. I do this for all 13 columns, yielding 13 numbers. Dividing by the sum to normalize gives a vector of probabilities. — Jackson Walters, Mar 06 '24 at 17:00
$1)$ You don't have to divide by the sum. You have thirteen probability values; those are the predicted probabilities of the patient having the disorder. $2)$ You don't have to do this one-at-a-time. The multivariate probit model will jointly model all thirteen probability values at once. This has an advantage if disorders tend to go together. That's why multi-label problems are studied instead of just modeling each label one-at-a-time. — Dave, Mar 06 '24 at 17:06

score 2 · Accepted Answer · answered Mar 06 '24 at 17:32

2

Binary problems use features to make predictions about a binary outcome: yes/no, alive/dead, dog/cat, etc.

A multi-class problem uses features to make predictions about an outcome that can be one of multiple categories: dog/cat/alligator, 0/1/2/3/4/5/6/7/8/9, shirt/boots/dress, etc.

A multilabel problem uses features to make predictions about an outcome than can be one, several, or none or multiple categories.

This is an archetypal multilabel problem.

There are thirteen categories, so the problem cannot be binary.
The categories are not mutually exclusive, so the problem cannot be multi-class.

Therefore, a reasonable task is to predict the probability of having each disorder vs not having the disorder, exactly a multilabel problem. One of the more basic models that still treats this as more complex than just thirteen binary problems (e.g., bipolar vs not, ADHD vs not, etc) is a multivariate probit model, which allows for correlations between outcomes. For instance, if ADHD and schizophrenia tend to go together, the model can make a point to assign high probabilities to one when the other has a high predicted probability.

Sophisticated models of such an outcome could include neural networks. These multilabel problems are far from obscure, so there should be plenty of tutorials available about the coding.

I will close with links to some posts on here about multilabel problems.

Ideal loss for a multi-label problem with soft targets

How robust is multinomial logistic regression to having a multi-label problem shoehorned into it?

Predictions when multiple outcomes

What is the statistical model for a multi-label problem?

Duplicates with different labels in multi-label classification

What is the difference between a multiclass and a multilabel problem?

What are the measure for accuracy of multilabel data?

Multilabel classification metrics on scikit

How to apply neural networks on multi-label classification problems?

answered Mar 06 '24 at 17:32

Dave

62,186

1

Excellent answer, thank you for the disambiguation. I've implemented both binary predictions for all 13 classes, and I'm using MultiOutputClassifier to do multi-label predictions. – Jackson Walters Mar 06 '24 at 18:27
1

@JacksonWalters Fitting thirteen binary models (e.g., logistic regression) is a way to solve a multi-label problem. Each of those thirteen models gives you the (predicted) probability of a disorder, so you wind up with the probability of each disorder. However, such an approach makes an assumption that the disorders are independent, which can be relaxed when you explicitly model the disorders jointly. – Dave Mar 06 '24 at 20:39
I haven't looked at the error metric for the 13 binary models yet, but for the MultiOutputClassifier(LogisticRegression(...)) I'm using hamming_loss and getting .1. I think this is consistent, in that actually predicting the diagnostic multi-label from demographic info plus some life factors should be pretty hard without any symptom info. Does using MultiOutputClassifier count as modeling them jointly? That's what I want to do, because they're not independent. – Jackson Walters Mar 06 '24 at 21:51
1

@JacksonWalters The documentation suggests to me that your function assumes independence. Software suggestions to implement a multivariate probit or other multilabel model are considered off-topic here but might be on-topic elsewhere. – Dave Mar 06 '24 at 21:57
Ah, okay. It appears multi-output != multi-label, I got confused. I did run the hamming_loss on the 13-binary-models, and got .12. I will seek elsewhere for a multilabel implementation. OneVsRestClassifier looks good. – Jackson Walters Mar 06 '24 at 22:04
OneVsRestClassifier sounds like a multi-class problem, not multi-label. @JacksonWalters – Dave Mar 07 '24 at 00:38
I'm not sure why a suggestion for a multilabel algorithm would be off-topic. Regarding OvR, from the sklearn documentation: "OneVsRestClassifier can also be used for multilabel classification. To use this feature, provide an indicator matrix for the target y when calling .fit. In other words, the target labels should be formatted as a 2D binary (0/1) matrix". In any case, I have installed scikit-multilearn and implemented MLkNN (after resolving a minor bug in their code) and I'm getting a Hamming loss of .11, consistent with .1 and .12 from the other methods. – Jackson Walters Mar 07 '24 at 11:39

How to encode multiple Boolean targets for logistic regression

1 Answers1