How can I determine statistical significance in an A/B test in which the KPI is dependent upon two variables - one bernouli and one continuous?

Question

In my work in online marketing, we frequently run A/B or multivariate web page tests. The Key Performance Indicator for these tests is overall Revenue. The treatment that nets the most revenue, either by influencing more conversions to sale, or by influencing a user to spend more per transaction, or a combination of these two, is the winner.

However, how can we determine we have a statistically valid Revenue impact in a particular test? The standard-issue significance test does is not useful here, since the Revenue impact is obviously due to a combination of probability to convert + dollar value of conversion.

We have played with a version of Overall Evaluation Criterion from the Taguchi playbook, but honestly I'm not statistically sophisticated enough to know if this approach is valid.

I appreciate any suggestions.

This seems like a natural candidate for hierarchal model, but without someone on staff who knows Bayesian methods and programming, I'm not sure how helpful that will be. If you think it would be helpful, I can write up a more detailed answer later this evening. — Sean Easter, Apr 16 '14 at 14:56
Thank you, Sean, we do have programmers on staff and external relationships who can help interpret & implement this type of solution. I would be very interested in hearing more detail about the model you would suggest. — Brian, Apr 16 '14 at 15:02

dimitriy · Accepted Answer · 2014-04-18T16:28:11.153

This is the basic setup for a two-part model (TPM). It is useful when your outcome looks like the Loch Ness monster:

enter image description here

Lots of customers do not buy anything, so the substantial mass point at zero makes a single index model problematic. TPM relaxes the assumption that excess zeros and positives come from the same DGP.

Given a non-negative outcome $y,$ and an exogenous treatment assignment variable $t \in \{T,C\}$, we have \begin{equation} \overbrace{\mathbb{E}(y \mid x,t)}^\text{# of Purchases or Revenue} = \underbrace{\Pr(y>0 \mid x,t)}_\text{Buy or Not? (EM)} \cdot \overbrace{\mathbb{E}(y \mid y>0, x,t)}^\text{Then How Much? (IM)} \end{equation}

Treatment can alter both the extensive and intensive margins. It can move customers from the head to the hump (EM), and it can move the hump further to the right (EM). The EM is modeled using a probit or logit model on the full sample. An LPM is another alternative. The IM uses robust Poisson for number of purchases, a GLM wih log link and inverse Gaussian or Gamma family for revenue, or even vanilla OLS with logged revenue, for those who actually bought something. In principle, you could have different covariates $x$ for each part.

You can look at the marginal effects at each margin, but you can also combine the estimates into overall average marginal effects: \begin{align} AME_{Levels} &=\frac{1}{N} \sum_{i=1}^N \left( \widehat{\mathbb{E}(y_{i} \mid t_{i} = T)}-\widehat{\mathbb{E}(y_{i} \mid t_{i} = C)} \right) \\ AME_{\%} &= \frac{1}{N} \sum_{i=1}^N \left( \frac{\widehat{\mathbb{E}(y_{i} \mid t_{i} = T)}-\widehat{\mathbb{E}(y_{i} \mid t_{i} = C)}}{\widehat{\mathbb{E}(y_{i} \mid t_{i} = C)}} \right) \end{align} These AMEs combine both intensive and extensive margin effects. I dropped the covariates from the equations above to reduce notational clutter, but they are implied.

In Stata, there's user-written command tpm by Federico Belotti, Partha Deb, Willard Manning and Edward Norton that will handle the hypothesis test on customer-level cross sectional data, including covariates :

tpm revenue i.treat x y z, first(probit) second(glm, family(igaussian) link(log)) robust
margins, dydx(treat)
margins, eydx(treat)

The output of margins might look something like this:

------------------------------------------------------------------------------
             |            Delta-method
             |      ey/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       treat |
  Treatment  |   .1411529   .0583387     2.42   0.016     .0268112    .2554945
------------------------------------------------------------------------------

That's corresponds to a 14% lift with a standard error of 6%.

Here's an attempt to hack something like the tpm marginal effects with some medical expenditure data. We will look at the two MEs of having health insurance, one on in dollars and one an elasticity (in %). The expected effect is negative. The first step defines a program that estimates the probit and GLM separately, and then calculates the marginal effect by hand by setting the insurance variable to 0 for everyone and then to 1 for everyone, which gives the marginal effects for each observation. I think you should be able to do something like this in R or Python:

set more off

use "http://cameron.econ.ucdavis.edu/musbook/mus16data.dta", clear

capture program drop mybs
program define mybs, rclass
    set more off
    preserve
        gen insurance = ins

        probit ambexp age i.female educ i.blhisp totchr i.ins, nolog
        replace ins = 0
        predict double phat0
        replace ins = 1
        predict double phat1

        replace ins = insurance
        glm ambexp age i.female educ i.blhisp totchr i.ins if ambexp>0, family(gamma) link(log) nolog
        replace ins = 0
        predict double yhat0
        replace ins = 1
        predict double yhat1

        gen double delta = phat1*yhat1 - phat0*yhat0
        sum delta, meanonly
        return scalar delta = r(mean)

        gen double lift = (phat1*yhat1 - phat0*yhat0)/(phat0*yhat0)
        sum lift, meanonly
        return scalar  lift = r(mean)
    restore
end

set seed 123456
bootstrap delta = r(delta) lift = r(lift), reps(1000) nodots saving(bs, replace): mybs

tpm ambexp age i.female educ i.blhisp totchr i.ins, first(probit, nolog) second(glm, family(gamma) link(log) nolog)
margins, dydx(ins)
margins, eydx(ins)

The bootstrapped results are:

. bootstrap delta = r(delta) lift = r(lift), reps(100) nodots saving(bs, replace): mybs

Warning:  Because mybs is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used.  This means that no observations
          will be excluded from the resampling because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that the dataset in memory contains only the relevant data.

Bootstrap results                               Number of obs      =      3328
                                                Replications       =       100

      command:  mybs
        delta:  r(delta)
         lift:  r(lift)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       delta |   -179.972   93.18576    -1.93   0.053    -362.6127    2.668751
        lift |   -.103458   .0611306    -1.69   0.091    -.2232717    .0163557
------------------------------------------------------------------------------

These are the marginal effects from tpm on the original data:

. margins, dydx(ins)
Warning: cannot perform check for estimable functions.

Average marginal effects                          Number of obs   =       3328

Expression   : tpm combined expected values, predict()
dy/dx w.r.t. : 1.ins

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       1.ins |   -179.972   89.62025    -2.01   0.045    -355.6244   -4.319519
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

. margins, eydx(ins)
Warning: cannot perform check for estimable functions.

Average marginal effects                          Number of obs   =       3328

Expression   : tpm combined expected values, predict()
ey/dx w.r.t. : 1.ins

------------------------------------------------------------------------------
             |            Delta-method
             |      ey/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       1.ins |  -.1099748   .0673351    -1.63   0.102    -.2419491    .0219995
------------------------------------------------------------------------------
Note: ey/dx for factor levels is the discrete change from the base level.

Dimitriy, this is fantastic. I am still digesting, but quick question: do you know of a sister command / package to tpm in R or Python? I am more familiar with those languages...and if not nbd. — Brian, Apr 17 '14 at 02:41
@Brian I am not aware of any non-Stata implementations. Hacking together the marginal effect is pretty easy, but getting the standard error is non-trivial. At first, I thought that it would be easy enough to boostrap, but that is not working for some reason. I have added code. — dimitriy, Apr 18 '14 at 00:05
@Brian I fixed the bootstrap after sleeping on it. Feel free to ask questions if the Stata code does not make sense. — dimitriy, Apr 18 '14 at 16:31

score 3 · Answer 2 · answered Apr 18 '14 at 16:26

You might consider a simple Bayesian model of the two treatments: Let whether each visitor makes a purchase under each treatment be Bernoulli random variables with respective parameters, and let purchase amounts be lognormal random variables with respective parameters log-scale parameters and equal shape parameters. (Lognormal is here used for illustration, taking for granted that it's a reasonable fit.)

Having done that, we can build a posterior predictive model for the average sale of each treatment. (See explanation in code comments.)

import pymc as pm, numpy as np
from matplotlib import pyplot as plt

# Priors for probability of a purchase, and for the log-scale parameters
# The beta distribution is conjugate to the Bernoulli and binomial, so it's a commonly used prior
thetaA = pm.Beta("pA", .5, .5)
thetaB = pm.Beta("pB", .5, .5)
muA = pm.Normal("muA", 0, 10e-6)
muB = pm.Normal("muB", 0, 10e-6)

# Now let's make some toy data
# Note that treatment b has a lower probability of purchase, but a higher average purchase
# These are the true purchase probabilities
observedPurchaseA = pm.rbernoulli(0.075, 1000)
observedPurchaseB = pm.rbernoulli(0.03, 1000)
obsPurchaseAmountsA = np.multiply(observedPurchaseA, np.random.lognormal(2, 0.25, 1000)) 
obsPurchaseAmountsB = np.multiply(observedPurchaseB, np.random.lognormal(2.1, 0.25, 1000))

obsPurTrA = pm.Bernoulli("purchasesA", thetaA, value=observedPurchaseA, observed = True)
obsPurTrB = pm.Bernoulli("purchasesB", thetaB, value=observedPurchaseB, observed = True)
obsAmtTrA = pm.Lognormal("amtA", muA, 1/0.25, value = obsPurchaseAmountsA[obsPurchaseAmountsA > 0], observed = True)
obsAmtTrB = pm.Lognormal("amtB", muB, 1/0.25, value = obsPurchaseAmountsB[obsPurchaseAmountsB > 0], observed = True)

# Since we observed more purchases in one treatment than the other, we have to contend with unequal sample sizes for purchase amounts
# That's not a problem in Bayesian analysis, but more data generally means a narrower posterior distribution 
# In other words, we'll be surer of the parameter for which we've observed more purcahses

# What we're really interested in is the posterior predictive distribution: The distribution of future purchases based on the data we've seen
# Rather than compute that directly, here I'm using simulation to compute the average sale per visitor at each step in the MCMC chain
@pm.deterministic
def expectedA(mu = muA, theta = thetaA):
    return np.mean(np.multiply(pm.rbernoulli(theta, 1000), np.random.lognormal(mu, 0.25, 1000)))

@pm.deterministic
def expectedB(mu = muB, theta = thetaB):
    return np.mean(np.multiply(pm.rbernoulli(theta, 1000), np.random.lognormal(mu, 0.25, 1000)))

model = [thetaA, thetaB, muA, muB, observedPurchaseA, observedPurchaseB, obsPurchaseAmountsA, obsPurchaseAmountsB, obsPurTrA, obsPurTrB, obsAmtTrA, obsAmtTrB, expectedA, expectedB]#, purAsim, purBsim, amtAsim, amtBsim]
mcmc = pm.MCMC(model)
mcmc.sample(30000,10000)

# Now let's compare the distribution of average sales
ax = plt.subplot(211)
plt.xlim(0, np.max(np.concatenate((mcmc.trace('expectedA')[:],mcmc.trace('expectedB')[:]))))
plt.hist(mcmc.trace('expectedA')[:], bins = 25)
ax = plt.subplot(212)
plt.xlim(0, np.max(np.concatenate((mcmc.trace('expectedA')[:],mcmc.trace('expectedB')[:]))))
plt.hist(mcmc.trace('expectedB')[:], bins = 25)

Now a look at those posterior plots: Posterior plots of average sales

The first treatment clearly outperforms the second. The formal decision rule for this could rely on comparing the credible intervals for the two treatments. Another option is to compute the distribution of the difference between treatments.

It's also a simplifying assumption to use equal shape parameters, but programming them into the model is straight-forward. But again, the data here is just a toy example; in practice you'd want to use a model with a well-evaluated fit.

Thank you Sean, I've requested the help of a more far informed friend to review both your and Dimitriy's response for suitability to our problem. — Brian, Apr 18 '14 at 23:51

How can I determine statistical significance in an A/B test in which the KPI is dependent upon two variables - one bernouli and one continuous?

2 Answers2

Linked