3

Trying to choose between these two tests for data I've harvested from Android store. Basically, I want to see if there is any difference in the number of dangerous permissions requested by free vs paid apps. I have equal sample sizes of 1900. When I plot the data they are both highly skewed, almost like decay curves. Under student-t I understand there is an assumption of normal distribution, but not sure what has to be normally distributed, so not sure whether student t would be the right test or whether to use non-parametric mann-whitney?

Glen_b
  • 282,281
piggo
  • 31

2 Answers2

2

Skewness will give you trouble with the t-test, yes. You could perhaps do a Mann-whitney, but since the data are counts, you probably need a test that fits with count data.

I'd be inclined to suggest assuming something like Poisson and then conditioning on the sum (giving a binomial test) ... but since you have a mix of applications, there may be additional skewness induced by that heterogeneity.

How skew are the distributions?

How were the applications selected?

You may ultimately be best off treating the applications as a random effect.

Glen_b
  • 282,281
  • the distributions are strongly skewed around 0, then fall off sharply. the applications were selected by randomly going through the letters of the alphabet, selecting 50, then killing duplicates. no effort was made to take an equal number from app categories, which is a potential bias. but other methods of selected (eg via leaderboards) also has biases..we got significant results for both students-t and mann-whitney.. – piggo Mar 10 '13 at 09:54
  • 1
    Another possibility is to do a permutation or randomization test. – Glen_b Mar 10 '13 at 12:12
  • 1
    If there were a lot of 0's (which is what I think you mean by "strongly skewed around 0") then you may need a model that accounts for that such as zero-inflated negative binomial models – Peter Flom Mar 10 '13 at 12:59
  • thanks! also wondering if transforming the data to a normal dist, then doing an indep students t test an option? – piggo Mar 10 '13 at 19:42
  • It can be, but beware that you're no longer comparing means of the original variable. It can still pick up some more general sense of increase (as long as the transformation is monotonic) - that is the null is still 'the distributions are the same' but the alternative is some general shift toward large/smaller values. e.g. a log-transformation would mean that on the original untransformed scale you were comparing something that's effectively scale (spread) rather than mean. Beware - if there's discreteness in your data (like a lot of zeros), you can't transform that away. – Glen_b Mar 10 '13 at 21:45
1
  1. Regardless of the skewness of the 2 samples, with 1900 samples per sample, the CLT does apply, and then some. So you absolutely can do a 2-sample t-test. Well, really a 2-sample Welch test (because, very likely, your 2 variances are quite different). In fact, one always should run a Welch test, regardless of the values of the variances; it works much better when the variances are different, and just as well when they are "close", and it is one less assumption to validate. So a t-test is 100% valid, no matter the skew/kurtosis; it is not the populations which need to be normal, but the sampling distribution of the statistic. And with 1900 samples, CLT will rule!
  2. Do you want to compare means; then run a t-test (see above). But maybe you want to look at proportions (e.g. what proportion of apps require more than 3 dangerous permissions). Here you would need the populations to be normal, and they are not. But, with 1900 samples, you can use binomial tests (i.e. contingency matrixes). Yes, you will lose power, but with such large sample sizes, you probably can afford it.
  3. Do not transform your data. Whatever the result you will obtain in the transformed space, you will not be able to interpret it in the "original" space (you can not reverse transform the statistics -e.g. the mean of the logs is not the log ot the mean-, or the CI's, or the p-values). So, if there is a significant "transformed" difference, it tells you nothing about whether there is a "real" difference.
  4. With such large samples, you are a prime candidate for bootstrapping. With 1900 values per sample, you will not need to defend your use of the method: your samples are very representative of the populations, so bootstrap away: difference of means, medians, proportions, etc...
  5. YOu could also run a Mann-Whitney U test. But with such differently skewed samples, you will not be able to claim difference of medians, just stochastic dominance (or absence thereof)

Conclusion: I would run a 2-sample Welch test for difference of means. Then I would use bootstrapping for anything else that might be of interest. And I would finish by a Mann-Whitney U test, for good measure, to see if I can add stochastic dominance to the findings (now, if the various previous tests did not show much difference, I would not bother with M-W U...)

jginestet
  • 440
  • 1
  • 9
  • "Highly skewed" suggests caution in applying the CLT in any fashion. See https://stats.stackexchange.com/questions/69898 for a real counterexample. The recommendation not to transform data seems severe, given that the objective is not to compare means but only to identify "differences." – whuber Jan 13 '24 at 22:36