Validation of Bayesian Hypothesis for AB test

Question

I have been following this methodology to implement a Bayesian A/B testing with Python on a new search engine feature that helps users to find products more accurately. To be more accurate, I split my users across three groups:

control (A)
disabled (B)
enabled (C)

Users are from different age range, gender, countries so they are randomly sampled in those groups to minimise the variance related to these differences.

Here are some counts:

variant   Gender       
control   MALE             1135282
          FEMALE            869479
          NOT AVAILABLE      28738
disabled  MALE             1118310
          FEMALE            870708
          NOT AVAILABLE      31323
enabled   MALE             1122299
          FEMALE            867738
          NOT AVAILABLE      29061
variant   ageBin

control   senior      837895
          Teenager    828989
          elder       593886
          adult       558086
disabled  senior      835150
          Teenager    832902
          elder       579681
          adult       558546
enabled   Teenager    836968
          senior      828841
          elder       588277
          adult       557469

Now, the control and disabled are the same but I wanted a way to be confident in my A/B/C statistical validation, where my theory was that A=B so there should be no difference/improvement or loss. The A/B test has been running for two weeks and I have:

             sampleSize  metrics  conversionRate
variant                                      
control       249660    53724        0.215189
disabled      248436    54236        0.218310
enabled       248091    55043        0.221866

At this size of sample, the prior does not seem to have a big influence but I am using a beta curve. When comparing A vs B, B vs C and A vs C, I found that:

 Chance of B beating A is 0.9962
 Chance of C beating A is 1
 Chance of C beating B is 0.9867

I don't feel like I can trust any of those results because B is beating A where I would expect the model to not be able to tell (confidence <90% at least), so I must have misunderstood something.

Any help or explanations would be greatly appreciated

A potential sense check: Is your "disabled" group truly equal to your control group, in the sense that they do not go through different computation steps (e.g. client side redirect) to get what you believe are the same user experience? Sections 8 and 9 of this paper explains potential problem arising on this front in more detail. — B.Liu, Mar 14 '22 at 08:55

kjetil b halvorsen · Answer 1 · 2022-03-12T18:50:59.933

I assume that you use the independence assumption, but with your sample sizes, I tend to distrust it! There must be some subgroups in your data, you did not give much details --- but maybe country, age, something else. There might be variation of churn rate with some such groups, and if the distribution of such variables are different within your tree treatment groups, that contributes to the differences you have observed. The standard errors computed from the binomial distribution will then be too small (unobserved heterogeneity). Some calculations with your data:

control  <-  0.215189
disabled <-  0.218310
enabled  <-  0.221866
disabled - control
[1] 0.003121
> enabled - control
[1] 0.006677
> ( enabled - control)/(disabled - control)
[1] 2.139378

so your disabled group is almost midpoint between control and enabled.
One idea could be to treat control and disabled as two control groups, and then the difference between them is really caused by the variance (including non-binomial variance) and that could be a basis for an alternative analysis. For now, here is a stored google scholar search for papers about analysis with two control groups. I will look into it ... but out of time now.

EDIT After the question included more information. Since there are additional variables like age range, gender, country, this can be controlled for, and will help to control/estimate (binomial) overdispersion. One way to do it is using a mixed effects logistic regression, which are discussed in many posts at this site, for instance

Thanks for the quick reply. Yes, there is a lot of parameters, age, country, gender, creating potential subgroups but since the creation of the A/B/C groups is randomly sampled, I was kind of hoping the variance between the subgroups to be minimal, otherwise how can we be ever sure that the difference see between B/C is due to the test/feature and not from those subgroup parameters? I will look if there is any answer in the papers you sent. Thanks! — mb_SW, Mar 11 '22 at 07:00
Please add this new information to the post as an edit, we want to keep all information in posts itself and not spread over comments! That way, also more people will see your post, so it can help you. Otherwise, I think a major problem in this case is overdispersion (relative to the binomial), and including other covariates will help with that, making it possible to estimate overdispersion. — kjetil b halvorsen, Mar 11 '22 at 13:26
(+1) in general using the additional information is almost always benefic because if anything it allows us to "explain out some of the variance" due to these additional explanatory features. While small at times, this can also highlight interactions we didn't originally anticipate. — usεr11852, Apr 07 '22 at 09:41

Validation of Bayesian Hypothesis for AB test

1 Answers1

Linked