I'm trying to predict the probability of an individual progressing all the way through a series of tasks. I have participants, they participate in Task 1, if they succeed, they move to Task 2, if they succeed they move to Task 3, if they succeed at Task 3, yay! I want to predict the probability of successfully completing all the tasks. Right now I'm running a binary logistic regression with the dependent variable being T/F on whether a participant succeeded on Task 3. But that's a very rare outcome and so all of my predicted probabilities are very low. I feel like I'm leaving a lot of information on the table in regards to how people have done on Task 1 and Task 2. I was thinking I could modify the dv so that it's 0 if an individual completed no tasks, 0.33 if they only completed Task 1, 0.66 if they completed Task 2 and 1 if they completed all 3. This makes sense to me because I'm going from a case where either an event happened or didn't to a case where I can incorporate information about how close someone got to the critical event occurring. But I can't seem to find anyone else doing it this way so I assume there's something very wrong about it.
Q1. Why shouldn't I do it this way?
Q2. What is the right way to use the information from the earlier tasks to get better predictions on the participants likelihood of success?
It's correct in that, as coded, the regression can't do a very good job of separating the two classes and so all my probabilities are low. My thought though is that incorporating the info from the earlier tasks may make it easier to separate the classes so that I can get probabilities which accurately span a wider range.
– M_Wholey Nov 06 '23 at 14:58