2

I have a dataset with two populations (unpaired). First one with 36 observations, second one 74. The first passes the Shapiro normality test with p-value = 0.1521, the second fails it with p-value = 0.01551.

I would like to perform a test t.test(first, second), how can I do? If I transform with a BoxCox the second population it doesn't make any more sense to compare it with the first population.

Edit: Here is the data asked by @BruceET

> lapply(g, function(x) c(mean=mean(x), var=var(x), sd=sd(x)))
$mark
     mean       var        sd 
21.986111  6.364087  2.522714

> lapply(p, function(x) c(mean=mean(x), var=var(x), sd=sd(x))) $mark mean var sd 25.378378 7.608293 2.758313

enter image description here enter image description here

Pier
  • 23
  • 5
    These tests of Normality are essentially irrelevant. It's unlikely you have a problem--but the best way to assess that begins by plotting the data distribution and not by reporting the p-value of a Normality test. – whuber Jun 04 '22 at 13:13
  • 1
    Can you show means, variances, boxplots of the two samples? – BruceET Jun 04 '22 at 15:43
  • 1
    @BruceET I have added what you asked for – Pier Jun 05 '22 at 11:48
  • 2
    See also: This CrossValidated link; the answer by Caracol for implementing a permutation test in R, and note the first comment by whuber under Caracol's answer regarding the applicability of a traditional t-test for the example. – Sal Mangiafico Jun 05 '22 at 14:10

2 Answers2

1

I attempted to digitize your data (approximately), based on your histograms. Of course, my two samples do not exactly match yours, but they seem sufficiently similar to your data to use for illustrative purposes:

x1
 [1] 18 18 18 18 18 19 19 19 19 19 19 20 20 20 20 20 20 20 20 20 20 21 21 21 21
[26] 21 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 25 25 26 27
length(x1); mean(x1); sd(x1)
[1] 50
[1] 21.12
[1] 2.115444

x2 [1] 19 19 19 19 21 21 21 21 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 [26] 23 24 24 24 24 24 24 24 24 25 25 25 25 25 25 25 25 25 25 25 26 26 26 26 26 [51] 26 26 26 26 26 26 26 26 27 27 27 27 27 27 28 28 28 28 28 28 29 29 29 29 length(x2); mean(x2); sd(x2) [1] 74 [1] 24.41892 [1] 2.668983

Normal quantile-quantile plots of both samples (x1 at left) are roughly linear, suggesting that neither sample is far from normally distributed.

enter image description here

Also, boxplots show no outliers or extreme skewness. So a pooled t test seems a reasonable choice. Notches in the sides of the boxes are nonparametric CIs calibrated to that lack of overlap suggests difference in medians. We should not be surprised if means are also significantly different.

enter image description here

A pooled 2-sample t test finds a highly significant difference in the two sample means--with t statistic $T = -7.32$ and P-value very near $0.$

t.test(x1, x2, var.eq=T)
    Two Sample t-test

data: x1 and x2 t = -7.3204, df = 122, p-value = 2.896e-11 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.191024 -2.406814 sample estimates: mean of x mean of y 21.12000 24.41892

We get the same results in 'stacked format':

x = c(x1, x2);  g = rep(1:2, c(50,74))
t.test(x1, x2, var.eq=T)$stat
        t 
-7.320372 

As in the discussion in Comments, it seems to me that a t test is appropriate. However, if some qualms about using a t test remain in spite of the usual favorable indications, we could use the pooled t statistic as the metric in a permutation test.

Whether the t statistic has exactly Student's t distribution with 122 degrees of freedom or not, this statistic seems a reasonable way to express the difference between sample means, compared with the variability of the samples.

Below we use R to approximate the permutation distribution of the pooled t statistic. We scramble the $n_1 + n_2 = 124$ observations between groups 1 and 2 repeatedly and find the t statistic for each permutation. The resulting values of $T$ form the permutation distribution of the t statistic. Here, the P-value of the approximate permutation test is essentially $0.$ [In fact, the approximate permutation distribution of the t statistic is approximately $\mathsf{T}(\nu=122),$ shown in the figure below.]

set.seed(2022)
t = replicate(10^4, t.test(x~sample(g), var.eq=T)$stat)
mean(abs(t)>=7.32)
[1] 0        # P-value of aprox permutation test.

enter image description here

Note: Also, the implementation of the 2-sample Wilcoxon rank sum test in R gives a reasonable P-value to indicate a change in location between samples x1 and x2. [There are many ties, but the sample sizes are large enough to get a reliable P-value.

wilcox.test(x1, x2)
    Wilcoxon rank sum test 
    with continuity correction

data: x1 and x2 W = 642.5, p-value = 6.301e-10 alternative hypothesis: true location shift is not equal to 0

BruceET
  • 56,185
0

If normality is heavily violated, you can consider non-parametric methods that do not make assumptions about population distributions. Usually, there is a non-parametric equivalent to common problems like the two-sample t-test where the two samples are independent. In this case, you can use a very general Mann-Whitney test.

tomathas
  • 5
  • 2