Extending logistic regression for outcomes in the range between 0 and 1

Question

I have a regression problem where the outcomes are not strictly 0, 1 but rather in the range of all real numbers from 0 to 1 included $Y = [ 0, 0.12, 0.31, ..., 1 ]$.

This problem has already been discussed in this thread, although my question is slightly different.

I can't use linear regression for the same reasons that logistic regression is normally used. In linear regression A) very large IVs values will skew the predicted outcome to 1 and B) the outcome of linear regression is not bounded to the 0,1 limits.

Looking at this logistic cost function from my textbook $$\text{Cost} = -y \log(h(x)) - (1 - y) \log(1-h(x))$$ I gather that the equation is designed to calculate a cost greater than 0 only when $y$ and $x$ do not have the same value 0 or 1.

Would it be possible to use logistic regression by modifying the cost function to measure all hypothesis errors?

score 11 · Accepted Answer · edited Nov 12 '12 at 02:09

11

You have several options. Two of them might be:

If you transform your $Y$ through the $\log(\frac{y}{1-y})$ logistic transformation you could try fitting a linear regression via ordinary least squares to that transformed response variable.
Alternatively, you could fit the original variable into a generalized linear model with the logistic transform as your link variable and with a relationship between $Y$'s variance and mean the same as though it were a binomial variable, fitting by iterative reweighted least squares. This is basically the same as "using logistic regression".

Which one to use would depend on the error structure, and the only way to decide is to fit them both and see which one has a residual structure that best fits the model's assumptions. My suspicion is that there would not be much to choose between them. Certainly, either of these options would be a big improvement on straight linear regression with the untransformed $Y$, for the reasons you say.

edited Nov 12 '12 at 02:09

cardinal

26,862

answered Nov 11 '12 at 23:40

Peter Ellis

17,650

2

(+1) Option 2: Typically you'd then estimate the over-dispersion & use that to calculate standard errors - a "quasi-binomial" model in which the relationship between Y's variance & mean is proportional rather than the same as that of a binomial variable. – Scortchi - Reinstate Monica Aug 21 '15 at 09:08
1

@Scortchi: Is this what glm() function in R is doing when it is fed with continuous response and family=quasibinomial? I.e. it will estimate the coefficients with family=binomial and then, in an extra step, compute standard errors taking over-dispersion into account? If yes, is this the same as computing "robust standard errors"? I have some appropriate data and I tried both families with glm; I get identical coefficients but differing standard errors. Thanks. – amoeba Sep 05 '16 at 19:28
2

@amoeba: Yes that's it. But "robust standard errors" usually means using a sandwich estimator or the like. – Scortchi - Reinstate Monica Sep 11 '16 at 16:18

score 11 · Answer 2 · edited Mar 03 '21 at 01:47

11

When Y is bounded, beta-regression often makes sense; see the paper "A Better Lemon Squeezer"

This allows for floor and ceiling effects; it also allows for modelling the variance as well as the mean.

edited Mar 03 '21 at 01:47

Community

1

answered Nov 12 '12 at 01:51

Peter Flom

119,535
36
175
383

score 0 · Answer 3 · answered Nov 11 '12 at 23:05

0

Since y is not strictly zero or one (as you said) cost should be always greater than zero. So, I don't think you need the modification in the model.

answered Nov 11 '12 at 23:05

Metrics

2,652

score 0 · Answer 4 · answered Nov 12 '12 at 02:19

0

I suggest two alternative models:

If your outcomes (y variables) are ordered, try an Ordered Probit model.

If your outcomes (y variables) are not ordered, try a Multinomial Logit model.

answered Nov 12 '12 at 02:19

power

1,732

Extending logistic regression for outcomes in the range between 0 and 1

4 Answers4

Linked