I have been following this methodology to implement a Bayesian A/B testing with Python on a new search engine feature that helps users to find products more accurately. To be more accurate, I split my users across three groups:
- control (A)
- disabled (B)
- enabled (C)
Users are from different age range, gender, countries so they are randomly sampled in those groups to minimise the variance related to these differences.
Here are some counts:
variant Gender
control MALE 1135282
FEMALE 869479
NOT AVAILABLE 28738
disabled MALE 1118310
FEMALE 870708
NOT AVAILABLE 31323
enabled MALE 1122299
FEMALE 867738
NOT AVAILABLE 29061
variant ageBin
control senior 837895
Teenager 828989
elder 593886
adult 558086
disabled senior 835150
Teenager 832902
elder 579681
adult 558546
enabled Teenager 836968
senior 828841
elder 588277
adult 557469
Now, the control and disabled are the same but I wanted a way to be confident in my A/B/C statistical validation, where my theory was that A=B so there should be no difference/improvement or loss. The A/B test has been running for two weeks and I have:
sampleSize metrics conversionRate
variant
control 249660 53724 0.215189
disabled 248436 54236 0.218310
enabled 248091 55043 0.221866
At this size of sample, the prior does not seem to have a big influence but I am using a beta curve. When comparing A vs B, B vs C and A vs C, I found that:
Chance of B beating A is 0.9962
Chance of C beating A is 1
Chance of C beating B is 0.9867
I don't feel like I can trust any of those results because B is beating A where I would expect the model to not be able to tell (confidence <90% at least), so I must have misunderstood something.
Any help or explanations would be greatly appreciated