In a language like Ancient Greek, verbal forms are marked for voice (active/middle/passive). Deponent verbs are verbs that exist only in the middle (or passive) voice, but appear to have an active meaning, e.g. ἔρχομαι 'come'. Suppose I have the impression that there are relatively many deponent verbs with a certain feature; for example, I may think that there are uncannily many deponent verb roots containing the letter ρ (this example is entirely made up). How would I go about testing whether this is significantly so?
I have a database of parsed verbal forms, so I can count how many verbs there are containing ρ, how many deponent verbs there are, etc. But what kind of statistical test should I use to calculate the likelihood the number of deponent verbs with ρ is due to chance?
Based on suggestions in the comments, here is what I would propose:
- Search for all verbal roots in a lexicon. Count the number of roots containing ρ. This gives me the a priori likelihood $p$ that any given root contains ρ.
- Count the number of deponent verbs in the lexicon. This is the size $n$ of my sample.
- Suppose there are $r$ deponent verbs with a ρ. The likelihood of this exact value being due to chance is
dbinom(r, n, p)in R. However I need to know the likelihood that there are $r$ or more deponent verbs with a ρ. For this I can use1 - pbinom(r-1, n, p). - I am not sure how to decide whether the value measured is "small enough" to consider it to be unlikely due to chance. Is this where you set a value like 0.05 based on how willing you are to accept a type I error?
What I am particularly worrying about (apart from if the above is correct):
- The sample is not independent of the population that I use to determine the a priori likelihood $p$. Is this a problem?
- Given that there are 24 letters in the Greek alphabet, even if the distribution over deponent verbs is due to chance, there will probably be some letters for which there are relatively many/few deponent verbs. Should I not take the number of letters into account as well? (I'm thinking of spurious correlations, where if you throw enough data in it is actually expected that there will be a strong correlation between some variables that need not actually be related.) Should I take this number 24 into account when deciding how small a likelihood measured in step 3 is meaningful?
- Does it matter for this if I have a hypothesis why the ρ would be frequent in deponent verbs? (For example, I may think that the ρ goes back to an old morpheme for expressing middle voice—again, this is entirely made up.) That could be grounds for excluding other letters from the analysis, but on the other hand my gut feeling is that the statistics should be independent of whether you have a hypothesis or not.