Can I calculate the significance of the number of deponent verbs with a certain feature like this?

Question

In a language like Ancient Greek, verbal forms are marked for voice (active/middle/passive). Deponent verbs are verbs that exist only in the middle (or passive) voice, but appear to have an active meaning, e.g. ἔρχομαι 'come'. Suppose I have the impression that there are relatively many deponent verbs with a certain feature; for example, I may think that there are uncannily many deponent verb roots containing the letter ρ (this example is entirely made up). How would I go about testing whether this is significantly so?

I have a database of parsed verbal forms, so I can count how many verbs there are containing ρ, how many deponent verbs there are, etc. But what kind of statistical test should I use to calculate the likelihood the number of deponent verbs with ρ is due to chance?

Based on suggestions in the comments, here is what I would propose:

Search for all verbal roots in a lexicon. Count the number of roots containing ρ. This gives me the a priori likelihood $p$ that any given root contains ρ.
Count the number of deponent verbs in the lexicon. This is the size $n$ of my sample.
Suppose there are $r$ deponent verbs with a ρ. The likelihood of this exact value being due to chance is dbinom(r, n, p) in R. However I need to know the likelihood that there are $r$ or more deponent verbs with a ρ. For this I can use 1 - pbinom(r-1, n, p).
I am not sure how to decide whether the value measured is "small enough" to consider it to be unlikely due to chance. Is this where you set a value like 0.05 based on how willing you are to accept a type I error?

What I am particularly worrying about (apart from if the above is correct):

The sample is not independent of the population that I use to determine the a priori likelihood $p$. Is this a problem?
Given that there are 24 letters in the Greek alphabet, even if the distribution over deponent verbs is due to chance, there will probably be some letters for which there are relatively many/few deponent verbs. Should I not take the number of letters into account as well? (I'm thinking of spurious correlations, where if you throw enough data in it is actually expected that there will be a strong correlation between some variables that need not actually be related.) Should I take this number 24 into account when deciding how small a likelihood measured in step 3 is meaningful?
Does it matter for this if I have a hypothesis why the ρ would be frequent in deponent verbs? (For example, I may think that the ρ goes back to an old morpheme for expressing middle voice—again, this is entirely made up.) That could be grounds for excluding other letters from the analysis, but on the other hand my gut feeling is that the statistics should be independent of whether you have a hypothesis or not.

assuming that whether a verb (stem) contains a rho is randomly distributed (with some probability), the number of verb (stems) containing a rho in a given sample (either a corpus or lexicon) is binomially distributed. Tests of statistical significance on binomially distributed data sets are a classic class of problems for students, so you should be able to find examples online — Tristan, Sep 06 '23 at 15:36
linguistic concerns people on this site may not be aware of: you should consider carefully how choosing between a corpus or lexicon based test will affect your conclusions, in this case there may be some advantages to both; you should make sure to exclude verb endings from your tests (whilst they don't contain ρ, they may contain other letters you choose to investigate, but their distribution is known and can be predicted, they will skew your data); depending on your precise question you may in fact want to look only at roots — Tristan, Sep 06 '23 at 15:38
it's been a long time since I've done a significance test so I'll leave writing this up to someone from this site, but the above is a quick summary of what I'd written up from before this was migrated — Tristan, Sep 06 '23 at 15:39
Thanks @Tristan! I will look at roots. I was planning to use a lexicon but to only include roots with more than some minimum number of occurrences, to make it more likely that I only select true deponent verbs (rather than that the active voice is not attested). Happy to learn about the benefits of either approach though. I don’t understand how “the number of verbs … in a sample … is binomially distributed”; it’s a single number, so how can it follow a distribution? It’s fine if you don’t want to write an answer of course, but I leave this here so that everyone can see where I’m struggling. — Keelan, Sep 06 '23 at 16:42
@Keelan If you throw two 6-sided dice once, you get a single number. But there is a probability distribution describing the outcomes, a probability associated with each possible outcome: 2,3,4,...,12. I would recommend starting by reading an introductory text on statistics. We have a number bearing the [tag:references] tag and this one is specifically about linguistics https://stats.stackexchange.com/questions/130913/a-good-intro-to-computational-linguistics — Sycorax, Sep 06 '23 at 16:48
This link may help you with your follow-up question: https://stats.stackexchange.com/questions/558327/do-all-observations-arise-from-probability-distributions — Arya McCarthy, Sep 06 '23 at 16:50
@Sycorax I suppose, but if I'd want to find out whether a pair of dice is rigged I'd throw it multiple times, rather than conclude it's rigged when it comes up 2 once. I'm sure this is not what you're suggesting, but as I'm working with a fixed corpus/lexicon I cannot just produce more data (throw multiple times). Or does every root correspond to one measurement (measuring a binary variable: deponent or not) and should I see whether there is a significant correlation between being a deponent verb and having a rho? (In any case, I'll start reading in the references in the question you linked.) — Keelan, Sep 06 '23 at 19:25
@Keelan The idea that observed data are a sample from some distribution over possible corpora/lexicons is a fundamental aspect of statistical modeling—without it, you couldn't compute the likelihood of a model. You're rediscovering this deeper question. — Arya McCarthy, Sep 06 '23 at 19:58
@AryaMcCarthy thanks, I'm getting this more or less also from Baayen's Analyzing linguistic data now, one of the references from the link above (unfortunately the question you linked to in your previous question was a little over my head). What I think I'll try to do is come up with a plan based on my reading and then edit it into this question, and hopefully someone will be able to give feedback on it. — Keelan, Sep 06 '23 at 20:06

Can I calculate the significance of the number of deponent verbs with a certain feature like this?

0 Answers0