2

I'm trying to predict the probability of an individual progressing all the way through a series of tasks. I have participants, they participate in Task 1, if they succeed, they move to Task 2, if they succeed they move to Task 3, if they succeed at Task 3, yay! I want to predict the probability of successfully completing all the tasks. Right now I'm running a binary logistic regression with the dependent variable being T/F on whether a participant succeeded on Task 3. But that's a very rare outcome and so all of my predicted probabilities are very low. I feel like I'm leaving a lot of information on the table in regards to how people have done on Task 1 and Task 2. I was thinking I could modify the dv so that it's 0 if an individual completed no tasks, 0.33 if they only completed Task 1, 0.66 if they completed Task 2 and 1 if they completed all 3. This makes sense to me because I'm going from a case where either an event happened or didn't to a case where I can incorporate information about how close someone got to the critical event occurring. But I can't seem to find anyone else doing it this way so I assume there's something very wrong about it.

Q1. Why shouldn't I do it this way?

Q2. What is the right way to use the information from the earlier tasks to get better predictions on the participants likelihood of success?

  • 3
    Why not code it 0, 1, 2 and 3 and then do an ordinal regression? Does that not answer your scientific question? – mdewey Nov 06 '23 at 14:47
  • 3
    Welcome to Cross Validated! Could you please elaborate on why you find it problematic that your predicted probabilities are low? If the events being predicted to happen with low probability really do not happen often, that seems like correct performance. – Dave Nov 06 '23 at 14:49
  • 4
    If you have a rare outcome, then your predicted probabilities should be low, no? – Stephan Kolassa Nov 06 '23 at 14:56
  • @mdewey my understanding is that ordinal regression gives probabilities of going 0 -> 1, 1 -> 2 and so on but I want the probability of going all the way through. Maybe a more clarifying question is, is there a substantial difference between coding the logistic regression in the way I've specified and the ordinal regression you propose that makes the ordinal regression better? – M_Wholey Nov 06 '23 at 14:57
  • @Dave

    It's correct in that, as coded, the regression can't do a very good job of separating the two classes and so all my probabilities are low. My thought though is that incorporating the info from the earlier tasks may make it easier to separate the classes so that I can get probabilities which accurately span a wider range.

    – M_Wholey Nov 06 '23 at 14:58
  • You might be able to improve your predictions for the final endpoint by leveraging whether someone has progressed successfully to the intermediate stages - but then these are predictors, not outcomes. You would essentially be modeling the probability for someone to complete Task 3, conditional on whether they have successfully done Tasks 1 and 2. That you can of course do. But your question reads as if you wanted to model all stages as outcomes. Which you can of course also do, but it will not change your prediction for the final result very much. – Stephan Kolassa Nov 06 '23 at 15:03
  • 3
    Phrased differently, do you want to model the probability of completing stages one through three, or do you want to model the probability of completing stage three given that the subject has completed an earlier stage? Both are reasonable to model but address different questions. – Dave Nov 06 '23 at 15:06
  • 1
    You might be asking too much of a model in this situation. How many individuals in your data set completed Task 3? How many predictors are you trying to include in the model? It would help to include that information, and critical parts of your comments, in an edit of your question; comments are easy to overlook and can be deleted. – EdM Nov 06 '23 at 15:32
  • 1
    @StephanKolassa I want probability of Task 3 completion not conditioned on earlier tasks, I actually need to have my prediction in hand before any individual even touches Task 1 so conditional is of no use. Your last sentence is maybe where the divide is. The way I'm thinking is - I have all this info about progression but I'm reducing it to one rare binary event. So if I just don't reduce it, I can give the model more info about task progression and it'll give me better predictions back. But you say it won't give much better predictions. I guess, clarifying this down, my question is - why? – M_Wholey Nov 06 '23 at 15:41
  • 1
    I think the better question should be the other way around: why do you believe that modeling intermediate steps should improve the predictions for the final step? If there is something in your data that improves the prediction for stage 2, which in turn allows you to improve the prediction for stage 3, then your "direct" model should already include and leverage that "something". – Stephan Kolassa Nov 06 '23 at 15:43
  • 1
    I wonder if modeling the intermediate steps would degrade performance in predicting the event of interest by sucking away resources (data) to predict something you don’t care to predict. – Dave Nov 06 '23 at 15:44
  • What are the variables you include as explanatory regressors? – Alecos Papadopoulos Nov 06 '23 at 16:03
  • @StephanKolassa with a logistic regression for every model and without clever feature transformations, I suspect it is possible that the staged predictions could outperform, just by having some nonlinearity in the aggregate? – Ben Reiniger Nov 06 '23 at 16:18
  • 1
    @BenReiniger But then why not just include that nonlinearity in a logistic regression on the stage-three outcomes? – Dave Nov 06 '23 at 16:20
  • @Dave That's why I said "without clever feature transformations"; it might not be obvious what the right transformation is, but just doing the staged predictions recovers something good enough. (Of course, a more flexible model might work just as well and still not require much effort. And if you can find a good transformation, sure, that's even better.) – Ben Reiniger Nov 06 '23 at 16:25
  • 1
    I suppose there's one more argument in favor of staged predictions: while you might not care about the intermediate predictions now, it could be useful information to monitor: "the model's not doing well; oh, it's specifically because people who complete stage 1 aren't completing stage 2 as predicted by that submodel; let's investigate that." – Ben Reiniger Nov 06 '23 at 16:26
  • Re your last comment: but now you are looking at whether someone is completing stage 2 conditional on completing stage 1, which is what you do not want to model... – Stephan Kolassa Nov 06 '23 at 16:56
  • 1
    Have you considered the sequential logit model? Example here. – dimitriy Nov 06 '23 at 19:09
  • @dimitriy could you turn it into an answer? I think it answers the reduction of data issue raised – seanv507 Nov 06 '23 at 20:06

1 Answers1

3

I'm trying to predict the probability of an individual progressing all the way through a series of tasks.

Logistic regression seems like an ideal model, except that I would model the response as a 1 if the participant did perform all tasks and a 0 if the participant didn't perform all tasks. A 0 may mean they did 3 but not the 4th final task or it may mean they didn't do anything at all, 1 may mean they just finished all in the allotted time, or they had considerable time to spare.

You can certainly fit a logistic model to non-integral responses on the unit scale, such as 0.33 or 0.66, (see other posts on this topic). But who are you to say how hard each task was and assign a number correspondingly? Consider that the 4th and final task may be impossibly hard so that no one in the sample completes it, but 1-3 are trivial, so that each participant is predicted to have a 75% chance of completing all tasks even though nobody actually did. That's why you shouldn't fit the logistic model with fractional responses.

AdamO
  • 62,637
  • 1
    I don't quite think general IRT is the way to go here, since the tasks are ordered: as far as I understand, you only get to task 2 after you have successfully completed task 1, and so forth. That would to me very much suggest ordinal logistic regression. – Stephan Kolassa Nov 06 '23 at 16:55
  • @StephanKolassa Of course we know very little about OP's precise problem. Recall these cumulative link models treat the intercept as a nuisance parameter, so without any fixed covariates, it amounts to predicting the empirical proportions for each task's completion rate - which is trivial - and you can't make very good inference from those parameters such as a CI to summarize Pr(4 or more tasks). – AdamO Nov 06 '23 at 17:25
  • @StephanKolassa But I see your issue now. You're correct, IRT requires conditional independence of the tasks, but not completing task 2 would completely determine probability of completing tasks 3 and 4. Anyway, we might ask why OP can't simply cross-tabulate results and call it done. – AdamO Nov 06 '23 at 17:27
  • @StephanKolassa deleted the last para RE: IRT per your comment. I have already stated how I feel a valid logistic model would be fit. – AdamO Nov 06 '23 at 17:28