I was wondering whether there is a specific procedure in either R or SAS which can handle binary correlated data (multivariate logistic regression). More specifically I have a sample of 400 individuals who have selected their food likes among a variety of available options (binary). Also the participants indicated their age, whether they are active or not and their gender. The food options were: pizza, salad, cheese cake and juices. I wanted to know whether there is a procedure in either R or SAS which can explore the correlations between the food options among different demographics for prediction modeling purposes (e.g. younger males who are active and like salad tend to like juices as well).
3 Answers
Since the OP has not been here since three years before I registered, I find it unlikely that we will get clarification on what is meant by a multivariate logistic regression or binary correlated data. I see two possibilities:
A regression to find the probability that someone selects an item from mutually exclusive choices
A regression to find the probability of someone selecting each item when all or perhaps none can be chosen.
The typical model for problem 1 would be multinomial logistic regression, which models the probabilities of mutually exclusive events. However, the OP makes it sound like someone can choose both juice and salad, so “mutually exclusive” does not seem to fit this problem. In that case, this is a multi-label problem.
R seems to have a package called “mlr” that does this sort of regression.
The theory behind such a model is that there are multiple binomial variables being modeled at once. While considering possible correlations between binomial variables, the regression models several binomial variables at once. Consequently, the regression could predict high probabilities for all four categories (e.g., $99\%$ for all four is possible…maybe someone is really hungry), and the regression could predict low probabilities for all four categories (e.g., $1\%$ for all four of the person is not hungry). The point is that the four predicted probabilities do not have to add up to one, unlike in a multinomial problem with mutually exclusive categories (which might omit one of the categories and leave the probability of that one as whatever probability is left over after the probabilities are calculated for the other categories).
I have a question posted on here about multi-label problems.
What is the statistical model for a multi-label problem?
Multivariate probit regression would be the typical model for a basic model of multiple labels, as the Gaussian-based link function makes it easier to handle correlations between the categories than a logistic link function. If you want something the linked “mlr” package does not provide, perhaps this would be a good place to start your search.
More advanced methods for doing multi-label problems exist, too, including deep learning approaches that I suspect are supported by the TensorFlow implementation in R.
- 62,186
Both R and SAS can handle your situation:
For R you can check http://www.ats.ucla.edu/stat/r/dae/melogit.htm. It is called Mixed Effects Logistic Regression. I think it is another name for "Multivariate Logistic Regression" note it is not "Multiple Logistic Regression"
To use SAS you can either use proc Genmod (GEE) http://www.math.unm.edu/~bedrick/glm/examples1.pdf or Proc GLIMMIX procedure to conduct a mixed effect logistic regression http://www.lexjansen.com/wuss/2006/analytics/ANL-Dai.pdf
- 4,746
Correction: If the participants were allowed to choose more than one option, Multinomial Logit will not work. (as Deep North noted)
Because you have four different options (pizza, salad, cheese cake and juices) for food choices, this fits into the framework of Multinomial Logit.
You can also consider a nested multinomial logit model. For example cheese cake and pizza can fall into one nest.
http://www.mea.mpisoc.mpg.de/uploads/user_mea_discussionpapers/dp16.pdf
- 861
-
3Multinational logit might not work since choice of pizza, salad, cheese cake and juices may not mutually exclusive. So you can not assume Multinomial distribution for your choice. – Deep North Aug 20 '15 at 02:10
-
1@ Deep north: If the participants were allowed to choose more than one option, then you are absolutely correct. – subra Aug 20 '15 at 03:09
glmerin that paper, they are discussing Hierarchical/Multilevel Generalized Linear Models. The confusion is quite understandable, as the terminology in the literature is extremely inconsistent and confusing. – Matthew Drury Aug 20 '15 at 02:12