My data is many thousands of responses to ~2000 multiple-choice (2 to 4 options) questions. Most of the questions have not been answered by most people. ie. a typical question will have <5% response rate.
I'm interested in finding out which questions have a high predictive power for which other questions. eg. People who answer option A to Q243 are five times as likely than average to answer option C to Q64.
I have constructed a (4x4) answer matrix for every question-pair combination that records (of those who answered both questions under consideration) how many people answered each possible combination of answers. This looks something like this:
Qb
| 0 1 2 3
--|-------------
0| 3 12 5 2
Qa 1| 11 2 25 10
2| 16 5 10 7
3| 10 2 10 17
So, the Probability that a random person (who answered both questions) selected Qa option 1, given that they selected Qb option 2 is: 25 / (5 + 25 + 10 + 10) = 0.5
In order to explore the predictive power of Qb option 2 for Qa option 1. I would want to compare this number, 0.5 to the probability that a random person would choose option Qa option 1 given no further information. ie. if 50% of the population in general, chose option 2 for Qa anyway then there would be no enhanced predictive power from knowing that the random person also answered option 2 to Qb.
My question is, what would be the best population to compare to?
- I could compare to the proportion of people who answered Qa option 1 of all those who answered Qa (so this would include people not recorded in the above comparison matrix - those who answered Qa but not Qb)
- Or I could compare just to those who answered both Qa and Qb (ie. the sum of the above matrix)
If the questions people answered were a random subset of all questions then there should be no difference, but there is some selection bias as people could choose to skip questions they didn't want to answer.