This is more a qualitative question than a quantitive one.
I have done an A B test to try to increase use of a particular feature (feature 1) by increasing discoverability.
I measure it by seeing how many users of new installs of the product are using feature 1 a week after installation.
The numbers look good (sample size is about 4000 for each case):
Control 0.31 users per install
Variation 0.40 users per install
So I'm happy, but then I notice a couple of things:
In the 4 months before the AB test, i.e. in the control case, the result was 0.37 (25k data points), so why 0.31 during the test? The monthly figures, for the preceding months were 0.37, 0.40, 0.33 and 0.38 -- there may have been other changes during this period, but it struck me that all of these numbers were higher than 0.31, and the 0.38 immediately precedes the testing period.
The product also has Feature 2, and I expect that use of it will be unaffected by the variation (unless I make up a story about synergy between the features), but the variation increased users of that feature by a similar amount.
So the numbers look good but something feels wrong to me. Am I being too picky?