I have a dataset structured something like this - obviously the numbers are exaggerated and I actually have hundreds of observations in each group.
Group 1:
| Observation | Number of specific items | Total items | Proportion of specific to total |
|---|---|---|---|
| Obs1 | 10 | 100 | .10 |
| Obs2 | 20 | 190 | .11 |
| Obs3 | 15 | 160 | .09 |
Group 2:
| Observation | Number of specific items | Total items | Proportion of specific to total |
|---|---|---|---|
| Obs4 | 150 | 500 | .30 |
| Obs5 | 13 | 50 | .26 |
| Obs6 | 75 | 250 | .30 |
In this case, each observation has a different total number of items and that number can vary a lot between observations.
I want to compare the Proportion of Specific Items (last column) between the two groups. The hypothesis is that the Proportion of Specific Items is different between the two groups. Essentially, I believe it is a comparison of means of the Proportion of Specific Items columns, but I'm concerned that a t-test is not the right test in this case due to that column being a proportion or a ratio of Specific Items to Total Items.
A t-test doesn't seem correct since I'm comparing proportional values but a Z-test doesn't make sense because I have multiple observations and cannot make a simple 2x2 contingency table.
Some additional clarifications and details:
Each of the count variables is expected to be distributed as a normal random variable. However, the two are surely correlated since increased total numbers of items would imply increased specific items as well, at least for this particular problem.
Doing a little more research I found the Ratio Distribution. This seems to fit quite nicely and there is a reference to a solution for the distribution of correlated random normal variables. (Hinkley 1969)
Further, similar questions have been asked here:
Test for significant difference in ratios of normally distributed random variables
Based on these results, I feel like my instinct that the t-test is not appropriate was correct. There does not appear to be a specific test for this, however. Some suggestions included the Delta Method (I need to study this one closer to feel comfortable with the math), and a permutation test to create an empirical null distribution.
The permutation test made me think of non-parametric approaches. Something like a Wilcoxon Rank-Sum test seems like it might be appropriate here, and given the number of observations (500+ in each group), I'm not too concerned about loss of power with such a test.
I appreciate any feedback on using a non-parametric test in this case.
EDIT: Added some clarifying text from my comment. Thanks to @whuber for asking for the clarification which very much helped me to better frame my question.
EDIT: Add some additional clarification regarding the distributions of variables, correlation, etc.