Hypothesis Testing (normality violated)

Question

I have two sets of data, SetA and SetB - they are metrics of the same group from different time or paired. SetA score at 10 am and SetB score at 5 pm, same person (30 of them).

I want to check if the mean is significantly different between the two. The scores violate NORMALITY, hence, I have considered 3 tests,would love your feedback.

The tests are giving me wrong answers. My role is to see if two Sets are significant different.

First,


import scipy.stats as stats
Data
setA = [0.9995, 1.0000, 1.0000, 1.0000, 1.0000, 0.0000, 0.9993, 0.9381, 0.6929, 0.7971,
        0.8464, 0.0220, 0.9979, 0.8584, 0.7538, 0.8027, 0.8768, 0.0231, 0.9990, 0.8611,
        0.6294, 0.7273, 0.8146, 0.0294, 0.9992, 0.8466, 0.7284, 0.7831, 0.8641, 0.0252]
setB = [0.9996, 0.9870, 0.7755, 0.8686, 0.8877, 0.0146, 0.9993, 0.9688, 0.6327, 0.7654,
        0.8163, 0.0240, 0.9992, 0.8571, 0.6735, 0.7543, 0.8366, 0.0272, 0.9989, 0.7375,
        0.6020, 0.6629, 0.8008, 0.0380, 0.9993, 0.8372, 0.7347, 0.7826, 0.8672, 0.0253]
Perform Wilcoxon signed-rank test
statistic, p_value = stats.wilcoxon(setA, setB)
Print results
print("Wilcoxon signed-rank test results:")
print(f"Test statistic: {statistic}")
print(f"P-value: {p_value}")
Check significance level (e.g., alpha = 0.05)
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the paired samples.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the paired samples.")
Data
Calculate the average mean for SetA
average_mean_setA = sum(setA) / len(setA)
Calculate the average mean for SetB
average_mean_setB = sum(setB) / len(setB)
print(f"Average Mean for SetA: {average_mean_setA}")
print(f"Average Mean for SetB: {average_mean_setB}")''''
Wilcoxon signed-rank test results:
Test statistic: 99.0
P-value: 0.01038770331258549
Reject the null hypothesis: There is a significant difference between the paired samples.
Average Mean for SetA: 0.7305133333333331
Average Mean for SetB: 0.6991266666666668

Second Test - Bootstrap sampling

# Calculate the observed mean difference
observed_mean_difference = np.mean(np.array(setB) - np.array(setA))
Number of bootstrap samples
num_samples = 10000
Initialize an array to store bootstrapped mean differences
bootstrap_mean_differences = np.zeros(num_samples)
Perform bootstrap sampling
for i in range(num_samples):
    # Resample with replacement from the combined dataset
    combined_data = np.concatenate((setA, setB))
    resampled_data = np.random.choice(combined_data, size=len(combined_data), replace=True)
# Calculate mean difference for this bootstrap sample
bootstrap_mean_difference = np.mean(resampled_data[:len(setA)]) - np.mean(resampled_data[len(setA):])
bootstrap_mean_differences[i] = bootstrap_mean_difference


Calculate the p-value
p_value = np.sum(bootstrap_mean_differences >= observed_mean_difference) / num_samples
print("Bootstrap hypothesis test results:")
print(f"Observed Mean Difference: {observed_mean_difference}")
print(f"P-value: {p_value}")
Bootstrap hypothesis test results:
Observed Mean Difference: -0.03138666666666667
P-value: 0.6493

Even if the SetB is 0s, it is failing to reject NULL (no significant difference).

Permutation Test

# Calculate the observed mean difference
observed_mean_difference = np.mean(setB) - np.mean(setA)
Number of permutation samples
num_permutations = 10000
Initialize an array to store permutation mean differences
permutation_mean_differences = np.zeros(num_permutations)
Perform permutation sampling
for i in range(num_permutations):
    # Combine the data
    combined_data = setA + setB
# Permute the combined data
permuted_data = np.random.permutation(combined_data)

# Calculate mean difference for this permutation sample
perm_setA = permuted_data[:len(setA)]
perm_setB = permuted_data[len(setA):]
permutation_mean_difference = np.mean(perm_setB) - np.mean(perm_setA)
permutation_mean_differences[i] = permutation_mean_difference


Calculate the p-value
p_value = np.sum(permutation_mean_differences >= observed_mean_difference) / num_permutations
print("Permutation hypothesis test results:")
print(f"Observed Mean Difference: {observed_mean_difference}")
print(f"P-value: {p_value}")
Permutation hypothesis test results:
Observed Mean Difference: -0.031386666666666674
P-value: 0.6376

(1) As @Estimatetheestimators writes, you should really be looking at repeated measures models, or paired t-tests. (2) The Wilcoxon and the t-test test very different things! (3) In your second model, you resample from the full dataset, instead of bootstrapping the difference in means (or bootstrapping the group means separately). That will not give you what you are presumably looking for, i.e., a confidence interval or a p-value for the difference in means. — Stephan Kolassa, Aug 28 '23 at 06:24
"The tests are giving me wrong answers." How can you know that? — Christian Hennig, Aug 28 '23 at 13:20
Note that the paired t-test assumes that the differences are normally distributed. How do you know that normality is violated? Note also that generally normality is always violated, and in many situations a t-test will work well anyway. It depends on how exactly normality is violated, i.e., whether anything goes on that will mislead the inference. Many deviations from normality are rather harmless. — Christian Hennig, Aug 28 '23 at 13:23
@ChristianHennig - I ran Q-Q plot and Shapiro Test - Q-Q plot have very rare fit. When I populate second set with all 0s, the Bootstrap was still saying it was not significant while the avg mean diff was high. — b t, Aug 29 '23 at 14:28
@StephanKolassa - for bootstrap, I am establishing observed mean dff (x) of two sets then iterating n times and storing more observed mean diff then (total mean diff above and greater) (y), to get p-value y/n, which is a probability metric I am testing against ..does this make sense? — b t, Aug 29 '23 at 14:30
I'm not quite sure I understand your explanation, sorry... to bootstrap the difference in means, you would bootstrap the observations in both groups separately, then calculate the means of the bootstrapped data within each group, then take the difference, store this difference, and iterate the procedure many times. Finally you would compare the observed difference in means to the distribution of the bootstrapped differences. — Stephan Kolassa, Aug 29 '23 at 14:46
@StephanKolassa - agreed, that is what I am doing with code. Not sure if there is a flaw, combined_data stores all boostrap sample then this calculates mean difference bootstrap_mean_differences[i] = bootstrap_mean_difference. After that it is proportion. — b t, Aug 29 '23 at 17:12

Hypothesis Testing (normality violated)

Data

Perform Wilcoxon signed-rank test

Print results

Check significance level (e.g., alpha = 0.05)

Data

Calculate the average mean for SetA

Calculate the average mean for SetB

Number of bootstrap samples

Initialize an array to store bootstrapped mean differences

Perform bootstrap sampling

Calculate the p-value

Number of permutation samples

Initialize an array to store permutation mean differences

Perform permutation sampling

Calculate the p-value

0 Answers0