1

This is more a qualitative question than a quantitive one.

I have done an A B test to try to increase use of a particular feature (feature 1) by increasing discoverability.

I measure it by seeing how many users of new installs of the product are using feature 1 a week after installation.

The numbers look good (sample size is about 4000 for each case):

Control 0.31 users per install

Variation 0.40 users per install

So I'm happy, but then I notice a couple of things:

  • In the 4 months before the AB test, i.e. in the control case, the result was 0.37 (25k data points), so why 0.31 during the test? The monthly figures, for the preceding months were 0.37, 0.40, 0.33 and 0.38 -- there may have been other changes during this period, but it struck me that all of these numbers were higher than 0.31, and the 0.38 immediately precedes the testing period.

  • The product also has Feature 2, and I expect that use of it will be unaffected by the variation (unless I make up a story about synergy between the features), but the variation increased users of that feature by a similar amount.

So the numbers look good but something feels wrong to me. Am I being too picky?

tgdavies
  • 111
  • 1
    You probably need to give us more details. Were the conditions exactly the same (beyond sample size)? Aren't there any seasonality or trends in general? – Tim Dec 20 '18 at 23:00
  • I've added some more information about fluctuations in the control rate. – tgdavies Dec 20 '18 at 23:21
  • 1
    Is it possible that there is some sort of temporal effect associated with the feature? For example might one of the features be used more/less often during the Holidays or during certain seasons? – StatsStudent Dec 20 '18 at 23:55
  • @StatsStudent that is possible, although I think it's unlikely in this case -- I certainly recognize that the historical data is not completely comparable with the data gathered during the test. – tgdavies Dec 21 '18 at 00:01
  • Another thought: Is it possible that the very nature of testing has introduced some effect that that reduced use of the feature? For example, perhaps including the testing condition in the new install package has somehow made the new feature more difficult (slower?) to use? You might consider a quick, back of the envelop statistical test to see if .31 is statistically different from the pooled means from the previous moths. – StatsStudent Dec 21 '18 at 00:21

0 Answers0