0

I have a problem where my dependent variable is given as a click-through rate and thus bounded [0,1]. While I have the traffic for each sample (a combination of design factors) and could reconstruct a dataset appropriate for logistic regression..is there a proper way to avoid doing this? From what I've seen it sounds like Quasi-binomial or Beta would work.

I'd prefer to do this in R, but the project requires Python which luckily has a lot of equivalents in the sm.statsmodels package. I thought that the standard GLM, Binomial w/ Logit link would not accept a continuous DV, but the model seems to output fine when given the freq_weights as additional argument. Is the code implicitly calling a Quasi-Binomial in the background?

enter image description here

2 Answers2

1

If you have the number of people who say the button (or who had the potential to click through) and your outcome is the number of people who actually clicked, then you can do a Poisson regression with an offset.

Poisson regression assumes the log of the expectation of $y$ can be expressed as

$$ \log(E(y)) = X\beta + \log(N) $$

Here, $\log(N)$ is an offset term. Some algebra can show this is equivalent to modelling $\log(E(y)/N)$, and since $y$ is a count then $y/N$ is a rate and $E(y)/N$ is the expected rate.

This is very straightforward to do in statsmodels. Just pass the log of the traffic to the offset argument

import pandas as pd
import numpy as np
from statsmodels.discrete.discrete_model import Poisson
import patsy
np.random.seed(0)

Create data

color = pd.DataFrame({'color':['Blue','Green']}) shape = pd.DataFrame({'shape':['Round','Square']}) size = pd.DataFrame({'size':['Regular','Small']}) df = color.merge(shape, how='cross').merge(size, how='cross') df['traffic'] = np.random.randint(low=1000, high=10_000, size = len(df))

X = np.asarray(patsy.dmatrix('~colorsizeshape', data = df)) beta = np.random.normal(0, 0.05, size = len(X.T)) beta[0] = 0.2 lam = np.exp(X@beta + np.log(df.traffic)) df['y'] = np.random.poisson(lam)

Model it

model = Poisson(df.y, X, offset=np.log(df.traffic)).fit()

model.summary()

enter image description here

You can verify that the estimates are close to their real values.

0

Hmm if your variable is bounded [0,1] and represents a rate of some sort or a count (which given time can then become a rate) it might be more useful to use a glm with a Poisson link function and include an offset term.

Refer to for more: When to use an offset in a Poisson regression?