2

Background and Setting

I have data of this format: on each subject the list of exposure to some subtances, some demographics and then a multiple response (whether the subject developed a disease or not and if so then what type of disease). For example:
Subject1 was 15 years old male living in city C; he was exposed to subtances A, B and C and the outcome: subject had skin cancer.
Subject2 .... outcome: subject had psoriasis...
Subject3 .... outcome: subject was healthy...etc


Question

I would like to be able to make predictions from these data, to estimate what disease a new subject is likely to develop given his demographics and exposure history. I tried to play around with logistic regression, but the response is not binary, it could be a multitude of outcomes/diseases (including cases of absence of disease). How can i proceed to have my predictions - if at all possible? Thanks

Dave
  • 62,186

3 Answers3

2

Check out Multinomial logistic regression.

Stat
  • 7,486
  • Adding a bit of explanation - you don't really have multiple outcomes, you have one outcome with multiple possible levels: No disease, psoriasis, cancer ... – Peter Flom Nov 23 '12 at 12:47
  • Maybe the OP can comment and I should say I am not an expert on this. But I think if the outcomes are just limited to the type of disease like (psoriasis, cancer,...) then we have one outcome with multilevels. However, here we also have the case of "absence of disease". So maybe multiple outcomes is much more appropriate term to use. – Stat Nov 23 '12 at 16:44
  • ok after examining the data, I don't think I will need to include the "absence of disease" cases. So in fact the outcomes will be limited to the type of disease. Thanks – francogrex Nov 24 '12 at 09:38
2

This problem is known as "classification", so searching for "classification" should prove helpful. CARTs (Classification and Regression Trees) are commonly used for this.

I recommend looking at The Elements of Statistical Learning by Hastie, Tibshirani and Friedman, which is available online.

User1865345
  • 8,202
Stephan Kolassa
  • 123,354
0

This sounds like a multi-label problem, which is somewhat different from a multi-class problem.

A multi-class problem uses the various features to estimate the probabilities of multiple events, exactly one of which will happen: the subject who is $28$ years old who lives in New York and who was exposed to asbestos has a probability of being healthy of $0.8$, a probability of skin cancer of $0.1$, and a probability of psoriasis of $0.1$; if you have more than just those two diseases, include them all to get probabilities of each disease. Then, of all the options (heathy or any one of the diseases), exactly one will happen. The likely outcome here is that the subject will be healthy, but the subject might also be upset to know of a decent probability of developing some nasty diseases, even if they’re not among the most likely outcomes. (Would you do something if it had a $30\%$ chance of causing you an agonizing death? Why not? You’re probably going to be fine.)

Multinomial logistic regression is the starting point for this kind of prediction. Such a model gives the probability of each individual disease and the probability of being healthy, the sum of which is one.

However, multiple diseases can happen. Someone can have skin cancer and epilepsy. Someone can have depression and diabetes. Someone can have Covid and HIV and liver cancer. Modeling (the probabilities of) categorical outcomes, multiple of which can occur, is a multi-label problem. The idea is to model the individual probabilities of binary events (cancer vs no cancer, depression vs no depression, Covid vs no Covid, etc), which might be independent but do not have to be.

My answer here gets more into the difference between multi-label and multi-class problems, and my question here has multiple nice answers that discuss the underlying statistical model of a multi-label classification, with the comment about an Ising distribution being particularly helpful (even if my main question about the prior probability remains unanswered).

Dave
  • 62,186