3

My question is similar to Testing difference in kurtosis between two samples where a comment suggested

Unless you are looking for an enormous difference in kurtosis, it's unlikely any physically realizable sample size will produce significant results.

Either I am doing something wrong, or I have an enormous difference.

I have data from an experiment (N = 20,000) where participants completed a behavioral task. There are two approximately equal sized groups. Group A is a control group and group B was exposed to something that is hypothesized to make the really good performers better, the really bad performers worse, and have no effect on average performers. While the field is in agreement with the hypothesis, nobody has really defined what a really good and bad performers is. We often talk about it as someone who is in the top or bottom 10%. While we would love to have longitudinal data, we only have a single data point for each subject (i.e., after group B was exposed).

To me this sounds like I want to test for differences in the kurtosis (i.e., are the tails in group B heavier than the tails in group A). So far I have resampled the data (this is how I approach most stats problems). I chose 20,000 subjects, with replacement, at random and divided them into their groups. I then calculated the kurtosis of the two distributions (I also calculated the mean, standard deviation, and skewness along with the 5, 25, 50, 75, and 95 percentiles of the distributions) and took the difference. I repeated this 10,000 times. The mean difference of the kurtosis is 3.5 with a standard deviation of 0.8. None of the values of the difference of the kurtosis is less than zero. This makes me think that there may be a statistically reliable difference between the groups.

What is a better way to test if there is a difference between the two groups?

QQ plot: Lower performance is better. The task is nominally bounded on both ends, but essentially no subjects perform at floor (-7) while 5% of subjects perform at ceiling (-30). Since the unexposed group (group A) is already at ceiling, I am not sure we will see the improvement due to the exposure of Group B.

enter image description here

StrongBad
  • 280
  • 1
    (1) Why use kurtosis when a more stable statistic like the variance might do? (2) Have you looked at a QQ plot? (3) With 20K subjects and (evidently) a low-kurtosis population (maybe your measurements are bounded?), you can indeed hope to test kurtosis. But (4) as is cogently argued in many other threads here, kurtosis is not a great surrogate for tail weight. In your case it might be (again, boundedness can help), but generally one would not utilize kurtosis as a first choice. – whuber Dec 13 '23 at 22:36
  • 1
    Also: your title suggests that bootstrapping does not support "formal testing." That would be a misconception. – whuber Dec 13 '23 at 22:48
  • @whuber (1) the field is adamant that it is a tail effect and not just more variability. Kurtosis very well may not be the right stat, but I am in way over my head. (2) I looked at them, but don't know what to do with them. (3) The measurement is definitely bounded (I will add that). (4) I am open to ideas. – StrongBad Dec 13 '23 at 23:03
  • 1
    I don't see the distinction between "tail effect" and "more variability." Are you saying that while the tails get heavier the variance (and other lower-order measures of variability, such as absolute third central moments) stay constant? We have threads on interpreting QQ plots, so they're worth a look. They show you in detail how all parts of the distributions vary. – whuber Dec 13 '23 at 23:08
  • With so many observations relative distributions with their plots might be useful! https://stats.stackexchange.com/questions/28431/what-are-good-data-visualization-techniques-to-compare-distributions/274058#274058 – kjetil b halvorsen Dec 14 '23 at 01:11
  • 1
    Tail weight is a vague property that can be measured in various ways that don't stop with moment-based kurtosis (not to get into the fact that some persist in regarding kurtosis as peakedness, and so on). Have you thought of L-moments as a framework? https://en.wikipedia.org/wiki/L-moment is a fair start. – Nick Cox Dec 14 '23 at 07:29
  • Can you say more about your data? What are the 2 % corrects? Could you show qq-plots? Note that bootstrapping is a perfectly fine way to conduct a hypothesis test and get a valid p-value. – gung - Reinstate Monica Dec 14 '23 at 12:46
  • @gung-ReinstateMonica I added the QQ plot. Not sure what 2% corrects are. My description of boot strapping was a poor choice. If that is the way to test for differences in kurtosis, then I will do it.. – StrongBad Dec 14 '23 at 22:58
  • @StrongBad, not "2%", the 2 (two) % (percent) corrects. Ie, group A gets 50% correct & group B gets 75% correct. – gung - Reinstate Monica Dec 15 '23 at 12:45
  • From the qq-plot, it looks like the distributions are largely similar in shape, but that group A has a ceiling, but group B blows through it. It is certainly (theoretically) possible to test kurtosis, but it's likely to be difficult. I wonder if it would be easier to test differences in skew. They look like they have different skews. In fancy terms, the mean is a moment, variance is the 2nd moment, skew 3rd, kurtotsis 4th, etc. It's easiest to test the mean, & it gets more difficult as you go up (eg, the variance depends on the mean, which adds greater uncertainty to the analysis). – gung - Reinstate Monica Dec 15 '23 at 12:50
  • It also seems like this might be a mixture distribution where there is some latent grouping variable that interacts with the treatment. I don't know if there are any clustering approaches that might be profitable. – gung - Reinstate Monica Dec 15 '23 at 12:51

0 Answers0