How to Handle 0 and 1 in Logit Transformation?

Question

I am planning to analyze experimental data using statistical methods, and I intend to perform analysis on repeated measurements using GEE (Generalized Estimating Equations) or RM ANOVA. Some of the measurements are all in proportion values (the number of individuals performing a specific behavior in each replicate). Previously, I used the arcsine square root transformation, but some papers claim that this method is not reasonable, mentioning that the logit transformation is more rational. Therefore, I want to analyze the data after logit transformation, but the data contains both 0 and 1. These values are represented as negative and positive infinity, respectively, making subsequent statistical analysis impossible. Is there a way to perform logit transformation in this case? Or should I abandon this method and stick with the previous arcsine square root transformation?

Why do you want to transform your outcome, rather than use a GLM with an appropriate link for the linear predictor? For example, a binomial GLM would not be able to predict a probability of exactly 0 or 1 for the reasons you mention, but that's rarely an issue in practice, and you can use your data as-is. I'm also not sure why you would treat such data as a proportion, thereby discarding the size of the denominator? This sounds like a classical example of a binomial experiment. — PBulls, Mar 13 '24 at 09:18
It would help us to know your response variable, your explanatory variables, and what you are trying to test. You say you have repeated measures : what is repeated ? — CaroZ, Mar 13 '24 at 11:07
Not quite understanding why you want Logit. For example, if the model is linear after log-log transformation where the original equation becomes $Y\approx P0,X1^{P1}X2^{P2}...Xn^{Pn}(S=1,2)^{Ps}$ then you can encode a binary variable for example, "For these fits, values of S equal to 1 for female and 2 for male were assigned so that lnS exists and so that regressions using ln S would find appropriate scaling exponents." DOI: 10.1097/01.mnm.0000237988.52572.2c — Carl, Mar 14 '24 at 01:52
Here is a more specific outline of the experimental procedure:
I observed the behavior of the target organisms and recorded it. There were 8 individuals in each jar, and I calculated the ratio of individuals showing the behavior by dividing by 8. Since there were a total of 12 jars, n=12. This behavior was observed twice a day for 14 days, resulting in 28 measurements of behavior. I compare the ratio of behavior occurrence between the control and experimental groups. — soobinism, Mar 14 '24 at 05:42
Based on the responses I've read, it seems that the arcsine square root transformation and logistic transformation may not be necessary. Previously, I have used the arcsine square root transformation before to achieve normality in different experiments. So, if I use a generalized model like GEE, would data transformation be unnecessary? — soobinism, Mar 14 '24 at 05:42
I do not think the bast approach is to transform your data. I would simply use a Generalized Linear Mixed Model for a binomial distribution. — CaroZ, Mar 13 '24 at 11:11

Nicholas Clark · Answer 1 · 2024-03-13T23:08:53.810

I would agree with the Caroz. If you have the number of individuals that completed a task, and you have the total number of individuals that tried the task in each replicate, you can easily use a Binomial observation model in a Generalized Linear Modeling framework. This will allow you to specifically ask targeted questions while accounting for repeated measures and other confounders. This lecture by Richard McElreath provides a very useful introduction (in fact, the entire series is well worth your time).

If you stick with proportions and use a Beta regression, you will not only have difficulties with the 1s and 0s, but you will lose the uncertainty associated with the number of individuals that tried the task. For example, a if 1 / 4 successfully completed the task, your proportion of 0.25 should be considered more uncertain than if 4 / 12 were successful. Treating your data as proportions and either using a Beta regression or some other form of regression will provide less useful inferences because your estimates won't consider these underlying uncertainties

What is above vs. below changes over time. Consider referring more specifically. — Richard Hardy, Mar 13 '24 at 17:25

score 2 · Answer 2 · answered Mar 13 '24 at 16:13

2

If your dependent variable is a proportion, logit isn't the right model. Logit (and probit, among others) are meant for binary (only 0 or 1) response variables.

The model you might want to consider is beta regression. However, since your data also contains exact 0s or 1s, you could either transform the data as described in the answers here, or directly use a more general version of beta regression, known as zero/one inflated beta regression.

answered Mar 13 '24 at 16:13

Durden

1,171

1

This is way out of date. Logit link models for continuous proportion responses have been in the literature for at least 50 years and are a natural extension of the classic approach for binary responses. Indeed logistic curves were used in C19 long before Berkson and others started their work in the 1940s. – Nick Cox Mar 14 '24 at 07:51
"This" being the first paragraph. Beta regression is what it is and if it works well, fine all round. – Nick Cox Mar 14 '24 at 12:04
Yes, there is such a thing as fractional logit and probit, but beta regression at least in my opinion is the more natural approach. – Durden Mar 14 '24 at 17:20
If I say it's the other way round for me, that shows that we can't get far on this. But I am always happy at the thought that people could and should choose a model for their data and their goals, and so forth. – Nick Cox Mar 14 '24 at 18:31
I'm not saying fractional logit would be a terrible choice. I just didn't want my above answer to be a catalogue of models to choose from. Instead, I decided to just recommend the one I consider most appropriate. – Durden Mar 14 '24 at 19:43

score 2 · Answer 3 · answered Mar 13 '24 at 16:50

2

You don't need to transform your data or use a generalized linear mixed model. You can use a linear model (yes, even with bounded outcomes) and adjust your standard errors to account for clustering in the experiment (i.e., by participant). A cluster-robust standard error also adjusts for heteroscedasticity and the fact that the outcome variable is not normally distributed within groups. This can be done using GEE as you have suggested without any modification to your outcome variable. I addition, the coefficients in your model can be directly interpretable as differences in means on the scale of your outcome variable.

Note that this approach can be problematic when the experiment is imbalanced, i.e., there are different numbers of observations in each cell of the design, or when you are additionally adjusting for a continuous covariate. But for most experimental designs where repeated measures ANOVA would be appropriate, this approach would be, too.

Here is how you would do this in R, requesting a cluster-robust standard error using sandwich:

fit <- lm(outcome ~ factor1 * factor2 * factor3, data = data)
lmtest::coeftest(fit, vcov = sandwich::vcovCL, cluster = ~participantID)

You can use functionality in marginaleffects to probe the conditions (see my answer here for an example).

answered Mar 13 '24 at 16:50

Noah

33,180
3
47
105

You can do this, it often works moderately well. As you realise, and as all readers should note, in principle such a model could predict outcomes outside $[0, 1]$. I'd say that logit models can do good approximations to linear when linear works well, but linear can't do good approximations to logit when the latter is really needed. – Nick Cox Mar 14 '24 at 12:03
@NickCox because OP mentioned it's an experiment, that suggests they can fit a saturated linear regression model, in which case there is no risk of out of bounds predictions or functional form misspecifications. The only problem is the error distribution, which can be obviated by using the robust standard error that comes from GEE or a cluster-robust SE. In other modeling contexts I agree a linear model can be problematic for bounded outcomes. – Noah Mar 14 '24 at 14:36
As far as I am concerned, being experimental or not is immaterial to the point I am making. I am not trying to be very specific about the OP's set-up. I'd rather that the variance structure was modelled directly than rely on robust SEs as a get-of-jail card. – Nick Cox Mar 14 '24 at 16:42
@NickCox The robust SEs for both a linear and logistic model are a function of the design matrix and residuals, which will be identical between the two model types if the model is saturated. That's why it matters whether we're in an experiment or not. I agree that in other cases it does matter how the model is structured. But we can take advantage of the unique design of this study to simplify inference at no cost. – Noah Mar 14 '24 at 17:42
I think you're exaggerating the similarity of the models by implication. I care most about getting the functional form of the model about right. Let me guess: you're an economist? – Nick Cox Mar 14 '24 at 18:29
Thank you for your deep discussion. I would like to analyze whether there was a difference in specific behavior between experimental and control groups when comparing them based on a single factor. This experiment was designed as a repeated-measures experiment, where the behavior was measured twice a day for 14 days, totaling 28 measurements. More detailed experimental conditions are in the comments on the question. – soobinism Mar 15 '24 at 04:10
The reason I initially considered using the arcsine square root transformation is that the data are proportions. I learned that it is necessary to perform the arcsine square root transformation to ensure normality for proportional data before conducting RM ANOVA. So, if I am going to use GEE, is it acceptable to use proportion data (no transformation of data)? Or, when I want to perform RM ANOVA, are there good data transformation methods to meet the assumptions of normality and other conditions? – soobinism Mar 15 '24 at 04:10
Normality is not an assumption of ANOVA; it is an assumption that is required for the usual standard errors to be valid. But I am recommending you use robust standard errors, which come with GEE, or can be implemented using the code I provided in my answer. – Noah Mar 15 '24 at 04:51

How to Handle 0 and 1 in Logit Transformation?

3 Answers3