Questions tagged [ab-test]

A/B testing, also known as split or bucket testing, is a controlled comparison of the effectiveness of variants of a website, email, or other commercial product.

A/B test, or split or bucket test, is a colloquial term for a controlled experiment in which users are randomly exposed to one of several variants of a product, often a website feature.

The Response or Dependent Variable is most often count data (such as clicks on links or sales) but may be a continuous measure (like time on site). Count data is sometimes transformed to rates for analysis.

Because they create temporary variants of 'live' websites, on-line A/B tests must overcome several challenges not common in traditional experiments of human preference. For example, differential caching of test versions may degrade website performance for some versions. Users may be shown multiple variants if they return to a website and are not successfully identified with cookies or by login information. Moreover, nonhuman activity (search engine crawlers, email harvesters, and botnets) may be mistaken for human users.

Useful References:

Kohavi, Ron, Randal M. Henne, and Dan Sommerfield. "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO." (2007).

Kohavi, Ron, et al. "Trustworthy online controlled experiments: five puzzling outcomes explained." Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012.

433 questions
14
votes
4 answers

What statistical test to use for A/B test?

We have two cohorts of 1000 samples each. We measure 2 quantities on each cohort. The first one is a binary variable. The second is a real number that follows a heavy tail distribution. We want to assess which cohort performs best for each metric.…
iliasfl
  • 2,554
6
votes
1 answer

What does the z-score mean in plain English (in A/B split testing)?

I'm relatively new to A/B split-testing and can't wrap my brain around the idea of z-scores. I know that a z-score gives one an idea how statistically significant the result is that one gets from a split-test. However, I found quite a few websites…
ckck
  • 163
6
votes
0 answers

Is it OK to prolong a non-significant A/B test?

Background: We know from this article that ending an A/B test early due to "significant" results is a mistake. Question: But what about when a test runs for the desired time period and shows insignificant results – is it fine to prolong it? What are…
Henrik N
  • 161
6
votes
1 answer

A/B test with unequal sample size

We are measuring traffic and conversion rates on an eCommerce website. Conversion rate is defined as the % of traffic (users) that purchased something on the website out of all traffic over a period of time. For example: If the total traffic was 1M…
staccato
  • 161
2
votes
1 answer

Determining significance and variance in A B C D testing

Scenario: An email campaign where four different email designs (treatments) are sent to four different populations of equal size in an attempt to find which performs better. The results returned are: Treatment Population Clicks A …
1
vote
0 answers

What implications does the discrepancy between the observed Minimum Detectable Effect and the initial MDE estimation have on the interpretation

Suppose the application has a click rate of 5%. A new version may improve this. Suppose that frequentists approach is used. To estimate the sample size the click rate of the new version is estimate with 6%. So this means a relative MDE (minimum…
rfalke
  • 111
1
vote
1 answer

AB testing: Control was performing 0.5% better than experiment set before the initiation of experiment

So we introduced a new feature in our app, that would aid conversion (hypothetically). When I tried to measure this incremental change in conversion, I split my base set of customers into control (C) 30% and test (T) 70% sets via random…
1
vote
0 answers

Cuped for A/B tests with difference before the test

Can we use CUPED to get rid of the difference between groups, that existed before starting A/B test? Is it accurate?
1
vote
0 answers

Interpretation of A/B Test with 3 Variants

We are currently having an interresting discusison about a recent A/B Test we ran and i'd very much appreciate your thoughts on it: We recently ran a A/B/C Test regarding the pricing of same day delivery: Control, current state:: free same day…
Sandro
  • 11
1
vote
1 answer

How would I run an A/B test if the observations are very right­-skewed?

As the title states, how would I run an A/B test if the observations are very right­-skewed? What could I do in order to still have a valid result? Should I remove outliers?
1
vote
0 answers

Good numbers, but something feels wrong

This is more a qualitative question than a quantitive one. I have done an A B test to try to increase use of a particular feature (feature 1) by increasing discoverability. I measure it by seeing how many users of new installs of the product are…
tgdavies
  • 111
1
vote
0 answers

When should I distribute traffic evenly across all variants of an experiment?

Actually we test our trust symbols in our shop with an GA experiment. (all trust symbols vs. no trust symbols) Our last test run 16 days and declared a winner while it had only 0.26 % better conversion rate than the loser. I do not have a good…
Jurik
  • 111
  • 2
0
votes
1 answer

How to calculate MDE for proportions?

When conducting AB tests, we use power analysis to calculate sample size with alpha, power and MDE (minimal detectable effect) parameters. Mean MDE for continuous variable seems intuitive: Using Cohen's D to calculate the standardized mean…
0
votes
0 answers

Calculating statistical significance across differing metrics in A/B testing

First up, apologies that this probably a dumb question. I am switching careers and just getting started in trying to understand statistical significance in A/B testing. I've come into issues quickly in identifying the right statistical test. I am…
Ned Miles
  • 101
0
votes
1 answer

Can you combine the results of two A/B tests?

Suppose, a feature of an app is redesigned and A/B tested in Region A with a small population, where x% of this population found the feature to be good, and it improved user engagement on the app. The same test is then performed in Region B on a…
1
2