Response distribution for GLM when variable is real number between 0 and 1

Question

I am trying to fit a mixed model with a fixed and a random component in Matlab. My response variable is a real number that can only be between 0 and 1. The data is left-skewed, meaning that many observations are very close to 1.

In the Matlab manual it says that I need to specify the approximate distribution of my response variable. I am a little stuck because none of the suggested cases really fit. Two questions here:

How should I go about and find a suitable distribution?
What can happen if I choose a wrong distribution? e.g. biased estimators, etc.?

score 1 · Accepted Answer · answered Nov 17 '20 at 12:17

Some options for models for this type of data include the following:

Use beta-regression (response distribution is a beta distribution, which takes values in 0 to 1).
Transform your data to the real-line e.g. using $x' = \log(x) - \log(1-x)$ as your response variable, then use a standard regression model (predictions & confidence/prediction intervals can be back-transformed using $\exp(x')/(1+\exp(x'))$). In this case you are assuming that the regression residuals after taking account other model covriates follow a normal distribution.
There's likely other reasonable options. E.g. if you just want to test some hypothesis in a randomized experiment, you could just use rank-based non-parametric methods.

Which of these makes sense will depend on which models fit the data reasonably well (you can of course make such models more complex e.g. by introducing random effects, if e.g. the observed data are overdispersed vs. what your model says), as well as what you want to achieve. If you need to get sensible predictions, taking into account that your data are bounded into (0,1) is important, but for example when modeling binary data (which can only be 0 or 1) some people use the so-called "linear-probability model" that models the response probability using linear relationships and allows predicted probabilities <0 or >1 (that can work okay for some purposes, but not so much as a prediction model).

It will depend a lot on the specifics of what you want to use your model for, how much a mis-specification of a model would matter (and depending on your specific setting you may have more or less tolerance for certain issues with a model).

For example, look at this tweet that models a percentage difference in votes (can only be between -100 to 100%) using linear regression. Clearly, this model is not perfect, in theory (given different data) capable of making absurd predictions of one candidate being ahead once all votes are counted by 110% or something like that, ignores the correlation in the observations and ignores that the underlying data are a finite number of votes (including potentially for a third candidate). However, in this particular case it provided a sort of sensible local linear extrapolation of a trend that was useful. Or to quote the much used "All models are wrong, some models are useful", in this case the model was a bit wrong, but probably quite useful to get an idea for where things were going.

A good general summary, however, on your reference to Normality with a link function, inserting perhaps a link to a more detailed discussion relating to understanding link functions, I would recommend. See, for example, here https://stats.stackexchange.com/questions/259683/understand-link-function-in-generalized-linear-model and also here https://stats.stackexchange.com/questions/163034/bayesian-logit-model-intuitive-explanation/163039#163039 . — AJKOER, Nov 17 '20 at 13:38
I am revisiting your answer and I was wondering if in your second bullet point it should be division log(x)/(1-log(x)), instead of subtraction log(x)-(1-log(x))? — Maria, Dec 23 '20 at 09:27
@Maria I don't think so. This is the logit transformation, i.e. the logarithm of the odds-ratio, which is x/(1-x). Thus, log(odds ratio) = log(x/(1-x)) = log(x) - log(1-x). Also note that log(x)/log(1-x) does not project [0,1] to (-infinity, infinity), but rather to [0, infinity) in an asymmetric fashion. Have a look at: https://www.wolframalpha.com/input/?i=log%28x%29+-+log%281-x%29+on+0+to+1 vs. https://www.wolframalpha.com/input/?i=log%28x%29+%2F+log%281-x%29+on+0+to+1 — Björn, Dec 23 '20 at 09:50

Response distribution for GLM when variable is real number between 0 and 1

1 Answers1