I am facing the unusual problem that my $p$ values are too good. They are so good that I must be doing something wrong, but I don't know what.
I am working with natural language data from a text corpus. The language I am working with has forms that are, in some contexts, optionally marked for plural number. The equivalent in English would be to be able to say both two trees and two tree. I want to test the hypothesis that marking (-s or none) depends on case (nominative [subject], accusative [object], or genitive [possessive]). In particular, I predict that in the nominative case there is relatively more -s than in the other cases.
There may be a mediating effect of the type of noun (denoting humans, animals, or things), so that the effect is stronger in some types than in others. It may also be the case that there are individual differences between nouns (thus, the ratio of marked : unmarked may be different for tree vs. goat, but still the prediction would be that that ratio is higher for nominative case than for other cases).
To test this hypothesis I have collected a number of nouns of interest, so I have a table that looks like this:
| Lexeme | Type | Case | Nmarked | Nunmarked |
|---|---|---|---|---|
| tree | other | nominative | 234 | 123 |
| tree | other | genitive | 456 | 567 |
| tree | other | accusative | 678 | 789 |
| goat | animal | nominative | 901 | 12 |
| ... | ... | ... | ... | ... |
I never learned any statistics, but from looking around I think this problem calls for a binomial logistic regression. So I have tried to fit such a model on my dataset using Python statsmodels:
import pandas as pd
import statsmodels.api as sm
exog_dummies = pd.get_dummies(exog)
exog_dummies = sm.add_constant(exog_dummies, prepend=False)
glm_binom = sm.GLM(endog, exog_dummies, family=sm.families.Binomial())
res = glm_binom.fit(use_t=True)
print(res.summary())
I am using get_dummies() to convert the "Lexeme", "Type", and "Case" columns to series of columns (e.g. "Case_nominative", "Case_genitive", "Case_accusative"). To be honest I am not exactly sure what add_constant() is for, but most online examples include it and my question stands regardless of it.
If I look at the data myself, it is easy to eyeball that while there is a difference between nominative and genitive, there is very little difference between nominative and accusative. Based on this I would expect that in the model only the variable for the genitive case has a significant effect.
What I find instead is that the variables for all cases have a significant effect, with a p-value of 0.000:
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Case_nominative -0.0962 0.012 -8.330 0.000 -0.119 -0.073
Case_genitive -0.4009 0.015 -25.916 0.000 -0.432 -0.370
Case_accusative -0.1058 0.021 -5.138 0.000 -0.147 -0.065
(My program also reports a pseudo $R^2$ (CS) of $1$, and a Pearson $\chi^2$ of $488$. My full dataset has $18$ lexemes with $5$ cases each, i.e. $90$ observations/rows in the table, with a total of $335,464$ word forms from the corpus.)
I do see that the coefficient is much stronger for the genitive. However, I find it dodgy that the p-values are so extremely low, and I am not sure how to report on these results to discuss the hypothesis that there is a significant difference in marking between cases.
If anyone can point out where I'm going wrong, that would be much appreciated.
results.t_test("Case_genitive = Case_accusative").t_testis for a single hypothesis,wald_testis for joint hypothesis, eg.results.wald_test("Case_genitive - Case_accusative, Case_genitive - Case_nominative"). (Aside: even if though it's calledt_testit uses the normal asymptotic distribution by default in GLM) – Josef Feb 24 '24 at 03:08