5

I have some data where each row has features and an output variable on the interval [0,1]. The output was likely the result of logistic regression, but we have no idea of the event data and want to recreate a logistic regression model without binary events. In other words, we will aim to recreate the model that works with our output (i.e. probabilities). What is the best way to do this:

a) Perform a logistic regression with the data as it. b) Convert probability to event data and therefore code as {0,1}. c) Multinomial regression.

Any thoughts or nuances would be most welcome.

1 Answers1

6

Logistic regression works perfectly well with probabilities as target variable, that's the way to go. You can also use beta regression. Notice however that if the target variable is predictions from another model, training on it means that your model is going to learn how to imitate the other model, rather than would be similar to training on the raw data.

Converting the probabilities to binary classes would lead to a loss of information. Moreover, to do this you would need to decide on a cutoff threshold and this gives you no guarantees whatsoever that the binary classes would be similar to the raw labels.

I have no idea how you could turn this into a multinomial regression problem, so can't comment on that, but likely the above comment applies as well.

Tim
  • 138,066