I have designed and run an online experiment in which we've slightly changed parts of a web page. Let's say users visit our website to place food orders and the order funnel looks like this: home --> menu --> basket --> checkout --> order. In this experiment we're making changes to the basket.
The success metric of the experiment is the basket-to-checkout click-through rate (CTR). That is, out of every user who enters the experiment by viewing the basket, how many make it to the checkout page? I've chosen this metric because it's closest to our intervention and it should be sensitive enough for us to detect a meaningful change. Note that this is a pageview-level metric. Each time a user views the basket, they may or may not make it to the checkout page.
The experiment is randomised at the user level. This means that a user who is assigned to the treatment condition when they enter the experiment (i.e. the first time they view the basket) will remain in the treatment condition for the duration of the experiment. Thus, if the same user comes back and views the basket a few days later, they will experience the same version of the basket they did the first time. The same applies for users who are assigned to the control condition.
So far so good. The problem I have is that the unit of randomisation (users) is not the same as the unit of analysis (page views). In their book Trustworthy Online Controlled Experiments (2020), Kohavi et al. state
We now look at CTR and discuss two common ways to compute it, each with different analysis units. The first is to count the clicks and divide by the number of page views; the second is to average each user's CTR and then average all the CTRs. If randomisation is done by user, then the first mechanism uses a different analysis unit than the randomisation unit, which violates the independence assumption and makes the variance computation more complex.
I have a dataset which has one row per (basket) page view, a user identifier, the experimental condition, and an indicator showing whether they made it to the checkout page. Note that users could appear in the dataset multiple times if they viewed the basket page multiple times during the experiment. Following Kohavi et al.'s advice, I have used the second method to calculate CTR:
- I first group my data at the user level
- Then I calculate each user's CTR (by taking the mean of the indicator column)
- Finally, I group by the treatment condition and take the average of all the user-level CTRs. I may then want to compare each condition's CTR and check whether they are significantly different
Now, say that I stop at step 2 and have a user-level dataset consisting of users, their experimental condition, and their average CTR. I want to fit a model to this data to test whether there's a significant difference in CTR between the experimental conditions.
If I had ignored the fact that the pageview-level data was not independent (due to the within-user correlation being introduced by users with multiple page views), I could have modelled the original data with a logistic regression where the response variable was the boolean indicating whether the (basket) page view had made it to the checkout. Instead, now I have a user-level dataset with rows independent of one another. However, the response variable is no longer a boolean but a proportion. Most users have a response value of exactly 1 or 0 because they visited just once and they either made it to the checkout page or not. But I also have lots of users with a response value that takes on a proportion between 0 and 1 (e.g. 0.84).
My question is: how should I model this data? My understanding is that I can't use a simple binomial GLM because the response variable isn't $\{0,1\}$ but rather $[0,1]$. Similarly, I don't think I can use a binomial GLM using the proportions as the response variable because if I did that I would have to pass each user's number of page views as the weights, which would be equivalent to fitting the previous model. I also tried fitting a binomial GLMM to the pageview-level (non-independent) data, with the user identifier as a random effect – but this was too computationally expensive. My next thought was that the answer was probably a beta regression, since it's normally used to model probabilities/proportions. However, most users in my dataset have a value of exactly 0 or exactly 1, and beta regression works in $(0,1)$. One option could be to transform my response variable by marginally adding or subtracting something when the value is 0 or 1, respectively – but that feels hacky.
What is the right way to model this data?
Error: vector memory exhausted (limit reached?). I've got around this by sampling a smaller subset of users before fitting the model. The problem is that the fitted values from the beta regression are too low. For example the observed CTR for the treatment condition is 92% but betareg gives an estimate of 80%. In contrast, the simple logistic regression model gives an estimate that's very close to the observed values. – Adrià Luz Jun 20 '23 at 15:14weightsand the results seem sensible (certainly more sensible thanbetareg's). Do you know what goes on when fitting such a model? – Adrià Luz Jun 20 '23 at 15:41