4

The question is simple. As for most of the online A/B test, we are more interested in the delta% than delta, for example:

delta = (mean of treatment - mean of control)
delta% = (mean of treatment - mean of control)/mean of control. 

Let say if we use two-mean T-test, (statistics is Delta) we found the statistics Delta is significant. Then can we say delta% is significant?

  1. If yes, why?
  2. If not, in this case, then use delta% as statistics, how do we know its distribution (let's say the sample size is big)?
yabchexu
  • 177
  • You need to distinguish between statistical significant and practical significance. For example, if you have 1+ million samples, you could show the treated groups spends a statistical significant 2 extra seconds viewing the web page, but if the average time per page is 5 minute, is that extra time an improvement? (a practical improvement? this is business decision and not a statistics question) – Dave2e Oct 02 '20 at 13:23
  • @Dave2e Your point about the distinction between economic and substantive significance is very relevant. But it is certainly possible to cook up an example where only one of the two passes at $\alpha = 5 %$. After all, $X-Y$ and $(X-Y)/Y$ are different random variables, with different distributions. – dimitriy Oct 03 '20 at 22:03
  • @Dave2e You are right. But you misunderstand my question. Actually, I was asking, in order to know delta% significant, could we use delta as the statistics? if not we have to use delta% as the statistics in the hypothesis test. The question is not asking if it is practically significant. – yabchexu Oct 04 '20 at 01:28
  • @DimitriyV.Masterov That what I am asking. (−)/ usually we have way to estimate the variance but it is difficult for me to determine its distribution. So usually I will use (−) as the Rv to determine if (−)/. I am not sure if this is strict. – yabchexu Oct 05 '20 at 18:42

2 Answers2

6

The absolute (delta) and relative (delta%) changes are different random variables, so you should try to calculate the ratio-based standard errors, CIs, and p-values if you care about the latter. This will not change your decision most of the time, but you will come across examples where it does matter (wider CIs, higher p-values, etc.). Ratios can be tricky like that.

Here's a toy example to make things clearer. Consider a two-sample test with a binary outcome where $N_T=N_C=1,359$. There are 163 successes in treatment, and 136 in control. The p-value on the absolute diff is 0.098, so you would reject the null that the two groups are same at $\alpha=.10$. However, the p-value on the relative difference is 0.101, so you would fail to reject. In some sense, this is an artifact of using a fixed threshold for significance and the approximation inherent in using delta method, but could lead to different decisions with the same data and decision rule but different definitions of difference.

Now on to your second question. There are many ways to calculate the variance, with varying complexity. It depends on what tools you have access to, features of your data and experiments, and your company's level of statistical sophistication.

These methods are:

  1. Delta method (either with correlated means or uncorrelated means)
  2. Fieller's method (with correlated or uncorrelated means or regression version)
  3. Regression (either transforming the outcome, or transforming the coefficients or using a GLM and then using the delta method or Fieller's method)
  4. Bootstrapping (relative difference itself or regression), permutation tests
  5. Some combinations of the above, like bootstrapped GLM regression

If you are willing to assume that the two means are uncorrelated (which usually makes sense in an A/B test), there are simple formulas linked above you can use (either delta method or based on Fieller's method). There are also canned commands/packages/online calculators.

If you are not willing to assume that, you can use regression since that returns the covariances pretty easily. Then you can use either the more complicated formulas that have the covariance term or have some stats package handle that for you. Another potential option is to log the outcome or use a GLM model to get the effects in percent.

Personally, I find some version of regression easiest, and that still works even if there is no correlation since in that case the covariance will be close to zero.

You can also bootstrap easily, ether the relative change itself or using regression coefficients. There is no formula since this is a resampling method. Make sure to set the seed so that you can replicate your work each time.

None of these approaches are exact, they are all approximations. In the toy example below, they align pretty closely.

For example, the delta method formula for the standard error of the relative change is

$$SE \left( \frac{B-A}{A} \right) \approx \sqrt{\frac{Var(B)\cdot B^2 - 2 \cdot Cov(A,B)\cdot A \cdot B + Var(A)\cdot A^2}{A^4}},$$

where $A$ is the mean in the control group and $B$ is the mean in the treatment group. Assuming uncorrelated means leads to covariance term being zero, simplifying the formula. Otherwise, regression is the easiest way to get the covariance between the two means.

Below I will compare blood pressure between men and women (analogous to a treated and control group) using Stata. I annotated the Stata code with some brief explanations.

You can find some regression-based examples of Stata and R code in Lye, J., & Hirschberg, J. (2018). Ratios of Parameters: Some Econometric Examples. Australian Economic Review, 51(4), 578–602. doi:10.1111/1467-8462.12300.

In this dataset, women have 5% lower BP relative to men, which is about 8 mmHg in absolute terms. All of the relative difference CIs are roughly in the [-8%,-2%] range:

. sysuse bplong, clear
(fictional blood-pressure data)

. keep if when=="After":when (120 observations deleted)

. isid patient

. /* summary stats */ . table sex, c(mean bp semean bp sd bp N bp)


  Sex |   mean(bp)     sem(bp)      sd(bp)       N(bp)

----------+----------------------------------------------- Male | 155.5167 1.967891 15.24322 60 Female | 147.2 1.515979 11.74272 60


. label list sex sex: 0 Male 1 Female

. set seed 10122020

. . /* (A) Absolute effect */ . ttest bp, by(sex) reverse

Two-sample t test with equal variances

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- Female | 60 147.2 1.515979 11.74272 144.1665 150.2335 Male | 60 155.5167 1.967891 15.24322 151.5789 159.4544 ---------+-------------------------------------------------------------------- combined | 120 151.3583 1.294234 14.17762 148.7956 153.921 ---------+-------------------------------------------------------------------- diff | -8.316667 2.484107 -13.23587 -3.397459


diff = mean(Female) - mean(Male)                              t =  -3.3480

Ho: diff = 0 degrees of freedom = 118

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 0.0005 Pr(|T| > |t|) = 0.0011 Pr(T > t) = 0.9995

. regress bp i.sex

  Source |       SS           df       MS      Number of obs   =       120

-------------+---------------------------------- F(1, 118) = 11.21 Model | 2075.00833 1 2075.00833 Prob > F = 0.0011 Residual | 21844.5833 118 185.123588 R-squared = 0.0867 -------------+---------------------------------- Adj R-squared = 0.0790 Total | 23919.5917 119 201.004972 Root MSE = 13.606


      bp |      Coef.   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- sex | Female | -8.316667 2.484107 -3.35 0.001 -13.23587 -3.397459 _cons | 155.5167 1.756529 88.54 0.000 152.0383 158.9951


. . /* (B) Relative Effect / . . / (-1) logged outcome t-test (works for strictly positive data and small relative differences) */ . generate ln_bp = ln(bp)

. ttest ln_bp, by(sex) reverse

Two-sample t test with equal variances

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- Female | 60 4.988759 .0100682 .0779881 4.968612 5.008905 Male | 60 5.041963 .0127957 .0991153 5.016358 5.067567 ---------+-------------------------------------------------------------------- combined | 120 5.015361 .0084655 .092735 4.998598 5.032123 ---------+-------------------------------------------------------------------- diff | -.053204 .0162819 -.0854466 -.0209615


diff = mean(Female) - mean(Male)                              t =  -3.2677

Ho: diff = 0 degrees of freedom = 118

Ha: diff &lt; 0                 Ha: diff != 0                 Ha: diff &gt; 0

Pr(T < t) = 0.0007 Pr(|T| > |t|) = 0.0014 Pr(T > t) = 0.9993

. . /* (0) bootstrap means */ . capture program drop mybs

. program define mybs, rclass

  1.     quietly summarize bp if sex==&quot;Female&quot;:sex
    
  2.     scalar female_avg_bp = r(mean) 
    
  3.     quietly summarize bp if sex==&quot;Male&quot;:sex
    
  4.     scalar male_avg_bp = r(mean) 
    
  5.     return scalar ratio = (female_avg_bp - male_avg_bp)/male_avg_bp
    
  6. end

  7. .

  8. . bootstrap ratio = r(ratio), reps(500) nodots nowarn: mybs

Bootstrap results Number of obs = 120 Replications = 500

  command:  mybs
    ratio:  r(ratio)


         |   Observed   Bootstrap                         Normal-based
         |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- ratio | -.0534777 .0153194 -3.49 0.000 -.0835031 -.0234522


. . . /* (1b) delta method using regression and ratio of predictions by hand */ . regress bp i.sex

  Source |       SS           df       MS      Number of obs   =       120

-------------+---------------------------------- F(1, 118) = 11.21 Model | 2075.00833 1 2075.00833 Prob > F = 0.0011 Residual | 21844.5833 118 185.123588 R-squared = 0.0867 -------------+---------------------------------- Adj R-squared = 0.0790 Total | 23919.5917 119 201.004972 Root MSE = 13.606


      bp |      Coef.   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- sex | Female | -8.316667 2.484107 -3.35 0.001 -13.23587 -3.397459 _cons | 155.5167 1.756529 88.54 0.000 152.0383 158.9951


. nlcom ratio:(_b[1.sex])/_b[_cons]

   ratio:  (_b[1.sex])/_b[_cons]


      bp |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- ratio | -.0534777 .015552 -3.44 0.001 -.083959 -.0229963


. margins, eydx(sex) // another way: calculate the elasticity

Conditional marginal effects Number of obs = 120 Model VCE : OLS

Expression : Linear prediction, predict() ey/dx w.r.t. : 1.sex


         |            Delta-method
         |      ey/dx   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- sex | Female | -.0549607 .0164307 -3.35 0.001 -.0874979 -.0224235


Note: ey/dx for factor levels is the discrete change from the base level.

. . /* (2a) logged outcome regression / . / works for strictly positive data and small relative differences */ . regress ln_bp i.sex

  Source |       SS           df       MS      Number of obs   =       120

-------------+---------------------------------- F(1, 118) = 10.68 Model | .084920032 1 .084920032 Prob > F = 0.0014 Residual | .938452791 118 .00795299 R-squared = 0.0830 -------------+---------------------------------- Adj R-squared = 0.0752 Total | 1.02337282 119 .008599772 Root MSE = .08918


   ln_bp |      Coef.   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- sex | Female | -.053204 .0162819 -3.27 0.001 -.0854466 -.0209615 _cons | 5.041963 .011513 437.94 0.000 5.019164 5.064762


. . /* (2b) GLM with exponentiated coefficients */ . glm bp i.sex, family(gaussian) link(log) nolog

Generalized linear models Number of obs = 120 Optimization : ML Residual df = 118 Scale parameter = 185.1236 Deviance = 21844.58333 (1/df) Deviance = 185.1236 Pearson = 21844.58333 (1/df) Pearson = 185.1236

Variance function: V(u) = 1 [Gaussian] Link function : g(u) = ln(u) [Log]

                                              AIC             =   8.075427

Log likelihood = -482.5256155 BIC = 21279.66


         |                 OIM
      bp |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- sex | Female | -.0549607 .0164307 -3.35 0.001 -.0871643 -.0227571 _cons | 5.046753 .0112948 446.82 0.000 5.024616 5.06889


. . /* (3) bootstrap ratio of predictions from regression by hand */ . bootstrap ratio = (_b[1.sex]/_b[_cons]), reps(500) nodots: regress bp i.sex

Linear regression Number of obs = 120 Replications = 500

  command:  regress bp i.sex
    ratio:  _b[1.sex]/_b[_cons]


         |   Observed   Bootstrap                         Normal-based
         |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- ratio | -.0534777 .0147228 -3.63 0.000 -.0823338 -.0246215


. . /* (4) Fieller's method (uncorrelated means) / . / there is also a correlated means version */ . fieller bp, by(sex) reverse Confidence Interval for a Quotient by Fieller's Method (Unpaired Data)

Numerator Mean: 147.2 Denominator Mean: 155.51667 Quotient: .94652234 95% CI: .91652092�.97771318

. . /* (5) delta method by hand (uncorrelated means) / . / there is also a correlated means version */ . table sex, c(mean bp sd bp N bp)


  Sex |   mean(bp)      sd(bp)       N(bp)

----------+----------------------------------- Male | 155.5167 15.24322 60 Female | 147.2 11.74272 60


. display "SE(ratio) = " sqrt(((15.24322^2/60)(155.5167)^2+(11.74272^2/60)(147.2)^2)/(155.5167^4)) SE(ratio) = .01566056

The Filler method above calculates $\frac{\bar Y_{female}}{\bar Y_{male}}$ rather than the relative change, but they are equivalent. The paper linked above has R and Stata code to calculate the relative change with regression.


Here is some code showing that the p-values can also differ depending on whether absolute or relative change is used with Wald and Wald-type tests:

. sysuse bplong, clear
(fictional blood-pressure data)

. keep if when=="After":when (120 observations deleted)

. estimates clear

. qui regress bp i.sex

. /* Absolute effect Wald-type test */ . testnl _b[1.sex] = 0

(1) _b[1.sex] = 0

           chi2(1) =       11.21
       Prob &gt; chi2 =        0.0008

. display r(p) .00081412

. /* Relative effect Wald-type test */ . testnl _b[1.sex]/_b[_cons] = 0

(1) _b[1.sex]/_b[_cons] = 0

           chi2(1) =       11.82
       Prob &gt; chi2 =        0.0006

. di r(p) .00058466

. /* Absolute effect Wald test */ . test _b[1.sex] = 0

( 1) 1.sex = 0

   F(  1,   118) =   11.21
        Prob &gt; F =    0.0011

. display r(p) .00109302

. /* Relative effect Wald test */ . margins, eydx(sex) post

Conditional marginal effects Number of obs = 120 Model VCE : OLS

Expression : Linear prediction, predict() ey/dx w.r.t. : 1.sex


         |            Delta-method
         |      ey/dx   Std. Err.      t    P&gt;|t|     [95% Conf. Interval]

-------------+---------------------------------------------------------------- sex | Female | -.0549607 .0164307 -3.35 0.001 -.0874979 -.0224235


Note: ey/dx for factor levels is the discrete change from the base level.

. test _b[1.sex] = 0

( 1) 1.sex = 0

   F(  1,   118) =   11.19
        Prob &gt; F =    0.0011

. di r(p) .00110368

dimitriy
  • 35,430
  • Hi Dimitriy, Thanks for this comprehensive explanation. It helps me a lot. I think to estimate the variance of ratio, I know how to do also you list several methods above. but have the estimation of the variance of the ratio is not enough, we need to know the distribution of stats in order to calculate the p-value if using a parametric test. I am not familiar with the wald-test above. Just take the example you gave " a two-sample test with a binary outcome" which test you are using and how did you calculate the p-value? – yabchexu Oct 17 '20 at 06:33
  • I used regression to do that part, with the delta method for the relative difference variance. You can also bootstrap to get the distribution of the statistic. Fieller’s method could be used to iteratively to get a p-value. I suppose permutation tests are another option to get the distribution under the null. – dimitriy Oct 17 '20 at 06:40
  • Thanks. For the nonparametric method, it is easy as we don't assume any distribution here. for the parametric method, I was confused about how to do it. Thanks a lot. – yabchexu Oct 17 '20 at 08:30
3

Equivalent null hypotheses

Yes

When you are asking the question about significance then you relate to hypothesis testing.

For your situation, the hypotheses $H_0: \Delta = 0$ and $H_0: \Delta\% = 0$ are equivalent if you assume that 'mean of control' is non-zero (and if you do not assume that, then the $\Delta\%$ becomes a problematic definition due to the potential division by zero).

So if $\Delta$ is significantly different from $0$ then you can also claim that $\Delta\%$ is significantly different from $0$.

Distribution of the data, not the parameter estimate

Also, note that the testing is normally not performed by observing only $\Delta$ or only $\Delta\%$. In that case, you would have a problematic situation with unknown nuisance parameters.

Instead, you use some test statistic which is based on the observations of treatment and observations of control, e.g their means $\mu_{treatment}$ and $\mu_{control}$. The test procedure would be the same for hypotheses $H_0: \Delta = 0$ and $H_0: \Delta\% = 0$ because you do not base yourself on the sample distributions of $\Delta$ or $\Delta\%$, but instead on the joint sample distribution of $\mu_{treatment}$ and $\mu_{control}$. You base a significance test on the data.

You might think that the significance test is different because estimates of $\Delta$ and $\Delta\%$ have different sample distributions. However, the parameter estimate and it's sample distribution is not necessarily the statistic that is used as a significance test.

For instance, when we perform linear regression, then we might estimate a parameter and perform a t-test which relates to the sample distribution of the parameter. But, we could also perform an analysis of variance and perform a F-test, which doesn't care how you express the parameter.

Significance testing is not primarily about the distribution of the estimate of some statistic (it can be used in testing, but it is derivative and not a first principle). Instead, it is in the first place about the distribution of the data conditional on the null hypothesis being true. The sample distribution of the data is the same for both null hypotheses. Therefore if an observation is significant for the one hypothesis, then it is also significant for the other.

Significance means that you made an extreme observation given the null hypothesis.


Confidence intervals

Where the use of $\Delta$ and $\Delta\%$ may differ is in the expression of confidence intervals. In this case, we are not talking anymore about the null hypothesis $H_0: \Delta = 0$ that assumes the parameter is equal to zero. But instead we consider the range of parameters $\theta$ for which the hypothesis $H_0: \Delta = \theta$ will pass a significance test (passing means no significance). The hypothesis $H_0: \Delta = \theta$ and $H_0: \Delta \% = \theta$, with $\theta \neq 0$ are not equivalent.

  • Does the p-value invariance hold in a regression setting? I added an example above where the p-values do seem to vary slightly depending on whether relative or absolute change is used. Or am I missing something? – dimitriy Oct 15 '20 at 20:29
  • 1
    The p-values could definitely differ between the two because as you said, they are different random variables with different sampling distributions under $H_0$. The point is that any test that rejects one $H_0$ rejects the other. One test may be more powerful, though. It's not that uncommon for multiple tests to test the same null hypothesis in different ways with different test statistics and p-values, like Bartlett's test and Levene's test for equality of variances. – Noah Oct 15 '20 at 20:40
  • @DimitriyV.Masterov I agree with Noah, the p-values can differ for different tests. But when we consider $H_0:\Delta$ or $H_0:\Delta%$ we are not necessarily considering different tests. The parameter estimates may be different statistics with different sample distributions, but they are not typically used to do the inference (in the same way in linear regression: you could consider the sample distribution of a coefficient, and perform a t-test, but for a more optimal test you should consider a F-test, which considers the residuals/data, and doesn't care how you express the coefficient). – Sextus Empiricus Oct 15 '20 at 21:03
  • @DimitriyV.Masterov I am not sure why your examples relate to different null hypotheses. You consider different variables, like the logarithm, but why does the one relate more or less than the other to a hypothesis like $H_0:\Delta=0$ or $H_0:\Delta%=0$? – Sextus Empiricus Oct 15 '20 at 21:11
  • Significance testing is not about the distribution of the estimate of some statistic (it can be used in testing, but it is derivative), but it is in the first place about the distribution of the data conditional on the null hypothesis being true. The sample distribution is the same for both null hypotheses. Therefore if an observation is significant for the one hypothesis, then it is also significant for the other. – Sextus Empiricus Oct 15 '20 at 21:16
  • The example at the bottom of my post does not involve logarithms. I have two types of tests for the absolute difference and the relative difference, with different p-values. – dimitriy Oct 15 '20 at 21:26
  • @DimitriyV.Masterov with those two Wald tests you are not testing different hypothesis, you are just using different statistics (the estimates of the coefficients $\hat\Delta$ vs $\hat\Delta%$ and estimates of their sample distributions) to perform the significance test. The tests are indeed different, but it does not mean that the related hypotheses are different. – Sextus Empiricus Oct 15 '20 at 21:44