The Bootstrap
If you can construct a well-defined function that expresses the observed effect magnitude (taking into account all four sets of data) in clinically important terms (which you really ought to be able to do – otherwise, what is it you are measuring?), you can always resort to the bootstrap:
Draw four new sets of data with resampling from the ones you've observed.
Compute the effect size of these simulated data sets.
Repeat the steps above many times.
What you get out of it is an approximation to the sampling distribution of interest. In other words, from this distribution you can compute means, standard errors, confidence intervals or whatever you need.
Example with made-up data:
import math
import random
ct_before = [4, 5, 2, 5, 3, 4, 8]
ct_after = [5, 6, 1, 5, 3, 5, 7]
tx_before = [5, 3, 5, 3, 4, 4]
tx_after = [5, 8, 7, 6, 4, 6]
observations = (ct_before, ct_after, tx_before, tx_after)
I don't know what is a clinically meaningful effect magnitude
in your case, so I'm just taking the difference of the
distances between the sums. You can come up with whatever you
want here and the algorithm will work fine.
def effect_magnitude(ct_b, ct_a, tx_b, tx_a):
ct_d = sum(ct_a) - sum(ct_b)
tx_d = sum(tx_a) - sum(tx_b)
return tx_d - ct_d
print(f'Observed effect magnitude: {effect_magnitude(*observations)}')
def bootstrap_replication(ct_b, ct_a, tx_b, tx_a):
return (
random.choices(ct_b, k=len(ct_b)),
random.choices(ct_a, k=len(ct_a)),
random.choices(tx_b, k=len(tx_b)),
random.choices(tx_a, k=len(tx_a))
)
def bootstrap_distribution(observations, B=5000):
for i in range(1, B):
yield effect_magnitude(bootstrap_replication(*observations))
distribution = sorted(list(bootstrap_distribution(*observations)))
B = len(distribution)
distr_sum = sum(distribution)
distr_sq_sum = sum(v2 for v in distribution)
mu = distr_sum/B
se = math.sqrt((distr_sq_sum - distr_sum2/B)/(B-1))
p05 = distribution[math.floor(B*0.05)]
p95 = distribution[math.ceil(B*0.95)]
print(f'Bootstrap mean effect magnitude: {mu:.2f}')
print(f'Bootstrap standard error: {se:.2f}')
print(f'Gaussian 90 % confidence interval based on bootstrap se: [{mu-1.645se:.2f}, {mu+1.645se:.2f}]')
print(f'Naïve bootstrap 90 % confidence interval: [{p05:.2f}, {p95:.2f}]')
This outputs
Observed effect magnitude: 11
Bootstrap mean effect magnitude: 10.92
Bootstrap standard error: 7.57
Gaussian 90 % confidence interval based on bootstrap se: [-1.53, 23.36]
Naïve bootstrap 90 % confidence interval: [-1.00, 23.00]
Theoretical approach
If you for whatever reason want a more theoretical approach, you still need to start from a function that expresses the observed effect magnitude. Assuming we like the one I used in the previous example, we see that it's effectively the sum of four sums.
For each group, we get the sample variance:
$$s_{c,b}^2 = 3.62\;\;\;\;\;\;\;\;s_{c,a}^2 = 3.95$$
$$s_{t,b}^2 = 0.80\;\;\;\;\;\;\;\;s_{t,a}^2 = 2$$
The variance of the sum of each one of them is $s^2\sqrt{n}$. Then the variance of the sum of the sums is those variances summed. So the variance of the effect size in this example would be
$$s_e^2 = \sqrt{7} (3.62 + 3.95 + 0.8 + 2) = 27.44$$
The standard error of the effect size is, then, $\sqrt{27.44} = 5.24$.
This gives you an observed effect magnitude of $11$, with a theoretical standard error of $5.24$, giving you a gaussian 90 % confidence interval of $\left[2.38, 19.62\right]$.
This interval is narrower than the naïve bootstrap one! Which is not surprising, because I deliberately picked data that would be just about insignificant or significant depending on which test you choose. Overall, the rough size of the intervals are the same, which is a good sign. I would report both if I tried both tests.