Is a binomial logistic regression valid in this case, and how do I use it / interpret its results?

Question

I am facing the unusual problem that my $p$ values are too good. They are so good that I must be doing something wrong, but I don't know what.

I am working with natural language data from a text corpus. The language I am working with has forms that are, in some contexts, optionally marked for plural number. The equivalent in English would be to be able to say both two trees and two tree. I want to test the hypothesis that marking (-s or none) depends on case (nominative [subject], accusative [object], or genitive [possessive]). In particular, I predict that in the nominative case there is relatively more -s than in the other cases.

There may be a mediating effect of the type of noun (denoting humans, animals, or things), so that the effect is stronger in some types than in others. It may also be the case that there are individual differences between nouns (thus, the ratio of marked : unmarked may be different for tree vs. goat, but still the prediction would be that that ratio is higher for nominative case than for other cases).

To test this hypothesis I have collected a number of nouns of interest, so I have a table that looks like this:

Lexeme	Type	Case	N_marked	N_unmarked
tree	other	nominative	234	123
tree	other	genitive	456	567
tree	other	accusative	678	789
goat	animal	nominative	901	12
...	...	...	...	...

I never learned any statistics, but from looking around I think this problem calls for a binomial logistic regression. So I have tried to fit such a model on my dataset using Python statsmodels:

import pandas as pd
import statsmodels.api as sm
exog_dummies = pd.get_dummies(exog)
exog_dummies = sm.add_constant(exog_dummies, prepend=False)
glm_binom = sm.GLM(endog, exog_dummies, family=sm.families.Binomial())
res = glm_binom.fit(use_t=True)
print(res.summary())

I am using get_dummies() to convert the "Lexeme", "Type", and "Case" columns to series of columns (e.g. "Case_nominative", "Case_genitive", "Case_accusative"). To be honest I am not exactly sure what add_constant() is for, but most online examples include it and my question stands regardless of it.

If I look at the data myself, it is easy to eyeball that while there is a difference between nominative and genitive, there is very little difference between nominative and accusative. Based on this I would expect that in the model only the variable for the genitive case has a significant effect.

What I find instead is that the variables for all cases have a significant effect, with a p-value of 0.000:

                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Case_nominative        -0.0962      0.012     -8.330      0.000      -0.119      -0.073
Case_genitive          -0.4009      0.015    -25.916      0.000      -0.432      -0.370
Case_accusative        -0.1058      0.021     -5.138      0.000      -0.147      -0.065

(My program also reports a pseudo $R^2$ (CS) of $1$, and a Pearson $\chi^2$ of $488$. My full dataset has $18$ lexemes with $5$ cases each, i.e. $90$ observations/rows in the table, with a total of $335,464$ word forms from the corpus.)

I do see that the coefficient is much stronger for the genitive. However, I find it dodgy that the p-values are so extremely low, and I am not sure how to report on these results to discuss the hypothesis that there is a significant difference in marking between cases.

If anyone can point out where I'm going wrong, that would be much appreciated.

Large effect sizes can make p-values small. Even small effect sizes can make p-values small if they come with with large sample sizes. This effect is regardless of model. Very small p-values aren't of themselves a cause for alarm. This is discussed in numerous posts on site. — Glen_b, Feb 23 '24 at 15:17
@Glen_b thank you. It may of course be that the model choice and my application of it is correct. But then I am still not sure how to report on the results and use them to support or reject the hypothesis. The coefficient for genitive is clearly different from the other two, but is the difference significance, and is the difference between the other two insignificant? I am not sure this model tells me that at all. — Keelan, Feb 23 '24 at 15:59
you can use t_test or wald_test for the hypothesis of linear (or affine) restrictions, e.g. results.t_test("Case_genitive = Case_accusative"). t_test is for a single hypothesis, wald_test is for joint hypothesis, eg. results.wald_test("Case_genitive - Case_accusative, Case_genitive - Case_nominative"). (Aside: even if though it's called t_test it uses the normal asymptotic distribution by default in GLM) — Josef, Feb 24 '24 at 03:08
@Josef thanks! I read up some more about what you can do after fitting the model and these pointers were very helpful. — Keelan, Feb 24 '24 at 15:55
I would start with a binomial mixed model that treats the lexeme as a random effect while type and case as two categorical fixed effects, optionally with an interaction between type and case. (Keep in mind that the product term will add (n_cases-1)(n_types-1) additional parameters.) I don't use statsmodels but see the documentation for Generalized Linear Mixed Effects Models. — dipetkov, Feb 24 '24 at 17:13
I admit that I missed the repeated element of the lexeme variable here and agree with @dipetkov that a mixed model may be more appropriate. — Shawn Hemelstrand, Feb 24 '24 at 23:31

score 3 · Answer 1 · answered Feb 23 '24 at 16:50

First, I think the $p = .000$ just isn't being precise enough. You likely have a very large decimal value that is cutoff by the output here (I'm not sure how this works in Python so I am making assumptions). It would certainly be suspect if it was exactly zero. Based off what I think you say above, you have an absurdly high number of observations of around $n = 335,464$ (which is a good thing of course). Looking at your standard errors, they are consequently quite small. So whatever your p-value actually is, the output at least at surface value seems logical, but perhaps not as precise as it should be. In any case, if they are indeed considerably low and this is for a scientific article, you can simply report it as $p < .001$ (as far as I know, this is the APA standard...one reports to the nearest three decimals possible).

Having said all of that, note here that your $\beta$ coefficients here are, if set to the defaults in logistic regression, on the logit scale. To get an idea of what the actual effect size is for each predictor (how strong the effect is), you should probably transform the coefficients by exponentiating them to get odds ratios that are directly interpretable (rather than relying on $p$ values alone). Or you can transform them to probabilities and generate predictions based off different inputs.

A couple more things. First:

To be honest I am not exactly sure what add_constant() is for, but most online examples include it and my question stands regardless of it.

Never use statistical functions if you don't know what they do. Check the documentation, but I think Python is sometimes known for having bad docs on stats topics. Second:

I never learned any statistics, but from looking around I think this problem calls for a binomial logistic regression.

You really ought to stop with your analysis and learn the basics of statistics before running models. Since you appear to be doing analysis on linguistics data, I recommend reading the book Statistics for Linguists by Bodo Winter. Though it is written in the R programming language, it does an excellent job of explaining regression modeling strategies that I think would be helpful in your case.

Thanks! I'm not sure if I was clear about the number of observations. I have 90 rows in my table (18 words, 5 cases), and the numbers are collected from 335,464 tokens in the corpus (i.e., the sum of the Nmasc and Nfem cells). I don't know if the low p-values (they are indeed not 0) is still reasonable then? The pointer to the odds ratio, that was very helpful. I will try to find someone locally who can help me with this, I just wasn't sure if the question could be answered here. (I know Winter's book, but always find it hard to decide what to do if my case does not match the example exactly.) — Keelan, Feb 24 '24 at 17:14
I see. Well regardless the $p$ values still make sense to me given the standard error is low and the t values match what should be a statistically significant result (the confidence interval also doesn't include zero either). Having looked at this again, I agree with dipetkov that the lexeme factor is a repeated measure and would be more appropriately modeled as a random effect, so a logistic mixed model would be better here (I believe Winter covers that near the end of the book). — Shawn Hemelstrand, Feb 24 '24 at 23:30
@ Shawn H: I have read some of your answers and you are a phenomenal teacher and explainer. I am struggling with this question about regression - can you please take a look at this if you have time? https://stats.stackexchange.com/questions/641830/seeing-if-a-poisson-regression-follows-the-necessary-assumptions thank you so much — Uk rain troll, Mar 05 '24 at 01:46
Thanks for the compliment! I would address the close reasons by adding some more detail in your question. Perhaps then it can be answerable by others. — Shawn Hemelstrand, Mar 05 '24 at 02:30

score 2 · Accepted Answer · answered Feb 23 '24 at 17:31

Shawn's answer (+1) covered most of what I would have said. Just a little bit of elaboration follows.

Fitting the model and getting the coefficients is only the first step. That doesn't by itself answer this part of your question:

I am not sure how to report on these results to discuss the hypothesis that there is a significant difference in marking between cases.

For that you need to perform post-modeling tests that evaluate differences among the coefficients for the levels of Case (including the level subsumed in the Intercept for the model, which presumably was included when you called add_constant()). That evaluation needs to go beyond "statistical significance"; as Shawn pointed out, it's important to evaluate the odds or probabilities of differences in outcome as a function of the predictor values, and then use your understanding of the subject matter to decide on the practical significance or the differences.

I don't know what tools are available for that in statsmodels. My sense is that the documentation for statsmodels tends to assume that you already know what you're doing in terms of statistics. You will be better off using a statistical package that has accessible explanations of the assumptions in different modeling choices, like those provided by this UCLA web page. Of those, R has the advantage of being available without financial cost.

In R, the emmeans package provides such tools for many different types of regression models. The rms package provides a combination of regression-modeling functions with useful ways to present their results, with a corresponding online text.

I strongly recommend that you consult with a local statistician before you try to publish these results. From what you describe, it seems that you have made an implicit assumption that the effects of Case are independent of the effects of Type. As you have a lot of observations, you might be able to relax (and test) that assumption by including an interaction between Case and Type in your model. Interpreting the coefficients in an interaction model can get tricky, however, so it would be best to work with someone who can help you go through the details and assumptions of the model and how to present your results.

Thanks a lot! I was hoping that this analysis would be able to tell me whether there is an interaction between Type and Case, but after reading more I understand this is not the case. I will definitely check with a statistician before writing anything up, but wanted to try to get as far as possible on my own and didn't know if I could give enough information here for you to be able to address my questions. Again, many thanks. — Keelan, Feb 24 '24 at 17:19
@Keelan that's why you need to understand the code you call. I suspect that get_dummies() only returns dummies for each of the individual categorical predictors, in which case your model wouldn't evaluate interactions. I suppose, however, that a function like that might be called in a way that does include interactions among the categorical predictor levels. I don't use pandas or statmodels unless I'm forced to, so I don't know for sure. The R default is to code categorical predictors automatically in a way so that you can specify which interactions you want. — EdM, Feb 24 '24 at 17:33
The patsy formula interface in statsmodels is mostly the same as in R, e.g. "cat1 * cat2" or "cat1 : cat2" include interaction terms in the design matrix, and categorical variables are encoded for a numeric design matrix. — Josef, Feb 24 '24 at 21:14

Is a binomial logistic regression valid in this case, and how do I use it / interpret its results?

2 Answers2