1

I am trying to simulate calculating Average Marginal Effects on a basic linear regression with interaction on a binary variable and compare the empirical standard deviation I get from simulations and the analytic standard error of the Average Marginal Effect I calculate using the Delta Method. But I keep getting a standard error that is way too small.

Below is my attempt with detailed explanation of each step:

  1. Data Generation codes:

I am simulating data of the form:

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_1 x_2 + \epsilon$

# import packages
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

def make_data():

# preset beta values
beta0 = 1
beta1 = 2
beta2 = 3
betas = np.asarray([beta0, beta1, beta2])

n = 500
x1 = np.random.choice(2, size=[n, 1])  # x1 is binary!
x2 = np.random.normal(1, 1, size=[n, 1])
x1x2 = x1*x2
X = np.hstack([x1, x2, x1x2])
epsilon = np.random.normal(0, 0.1, size=[n, 1])
y = beta0 + x1*beta1 + x1x2*beta2 + epsilon

data = np.hstack([y, X])
df = pd.DataFrame(data=data, columns=['y', 'x1', 'x2', 'x1x2'])

return df, x1, x2, x1x2

df, x1, x2, x1x2 = make_data()

  1. Calculate AME (average marginal effect)

I calculate the AME by difference formula for binary variables taken from the official doc of STATA here:

$\frac{1}{n} \sum_{i=1}^n (\hat{\beta}_0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x_2) - (\hat{\beta}_0 + \hat{\beta}_1 (0) + \hat{\beta}_2 (0) x_2) $

def compute_ame(df, x1, x2, x1x2):
formula = "y ~ x1 + x1x2"
model = smf.ols(formula, data=df).fit()

return np.mean( model.predict(df.assign(**{"x1":1, "x1x2":x2})) - model.predict(df.assign(**{"x1":0, "x1x2":0})) )

compute_ame(df, x1, x2, x1x2)

The output value is around 5, which makes sense since $\beta_1 = 2, \beta_2 = 3$, and $x_2 $ has a mean value of $1$. So from the equation, $y = \beta_0 + \beta_1 x_1 + \beta_2 x_1 x_2 + \epsilon$, turning $x_1$ on increases $y$ by 5.

  1. Get the covariance matrix of the model
formula = "y ~ x1 + x1x2"
model = smf.ols(formula, data=df).fit()

vcov = model.cov_params().values

  1. Get the gradient of AME with respect to each $\beta$'s

$\frac{\partial}{\partial \beta} \frac{1}{n} \sum_{i=1}^n (\hat{\beta}_0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x_2) - (\hat{\beta}_0 + \hat{\beta}_1 (0) + \hat{\beta}_2 (0) x_2) $

$= \big[ (1-1) \enspace\enspace 1 \enspace\enspace \frac{1}{n} \sum_{i=1}^n x_2 \big]$

We have one row matrix here because we are getting AME for one variable, namely the binary variable $x_1$

grad = np.asarray([0, 1, np.mean(x2)]).reshape([-1, 1])
  1. Calculate the standard error:

$(grad)^T (vcov) (grad)$

result = grad.T @ vcov @ grad
  1. Run some simulations to get many AME values so we can calculate their standard deviations:
ames = []
for i in range(1000):
df, x1, x2, x1x2 = make_data()
ame = compute_ame(df, x1, x2, x1x2)

ames.append(ame)

np.std(ames)

  1. Compare the results:
print(np.std(ames))  # 0.13759581829
print(np.sqrt(result[0][0]))  # 0.0091748250468 from step 5

The analytic standard error is REALLY small.

What am I doing wrong here?

UPDATES (for answer by user4422):

I separate out variable generation from adding Gaussian noise to re-do the simulation as was advised:

def make_data():
# preset beta values
beta0 = 1
beta1 = 2
beta2 = 3
betas = np.asarray([beta0, beta1, beta2])

n = 500
x1 = np.random.choice(2, size=[n, 1])  # x1 is binary!
x2 = np.random.normal(1, 1, size=[n, 1])
x1x2 = x1*x2
X = np.hstack([x1, x2, x1x2])
# COMMENTED OUT NOISING CODES
# epsilon = np.random.normal(0, 0.1, size=[n, 1])
y = beta0 + x1*beta1 + x1x2*beta2 # + epsilon

data = np.hstack([y, X])
df = pd.DataFrame(data=data, columns=['y', 'x1', 'x2', 'x1x2'])

return df, x1, x2, x1x2

def add_noise(df):

n = 500
epsilon = np.random.normal(0, 0.1, size=n)
df_noised = df.copy()
df_noised['y'] += epsilon

return df_noised

The for loop now changes to first generate dataframe and then add new noise in each loop:

ames = []

df, x1, x2, x1x2 = make_data() # generate variables here first

for i in range(1000):

df_noised = add_noise(df)  # only change the noise component
ame = compute_ame(df_noised, x1, x2, x1x2)
ames.append(ame)

np.std(ames)

The standard deviations are similar now

print(np.std(ames))  # 0.0091397150909881
print(np.sqrt(result[0][0]))  # 0.0091748250468 from step 5
StatsNoob
  • 95
  • 8
  • $$\frac{1}{n}\sum_{i=1}^n (\hat{\beta}0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x_2)-(\hat{\beta}_0+\hat{\beta}_1(0)+ \hat{\beta}_2(0)x_2)$$ could you specify a bit more precise which formula from the source you applied here? You have these numbers in brackets $(0)$ and $(1)$ where do they come from? If I simplify it further, aren't you effectively computing the following $$\frac{1}{n}\sum{i=1}^n\hat{\beta}1+\hat{\beta}_2 x_2$$ I guess that those $1$'s and $0$'s are not correctly placed. It should be the values of $x{1j}$ (which is another thing, your expression does not use the index $i$. – Sextus Empiricus Jun 27 '22 at 06:41
  • If you are writing a reproducible example then it could be nice to expand the following expression return np.mean( model.predict(df.assign(**{"x1":1, "x1x2":x2})) - model.predict(df.assign(**{"x1":0, "x1x2":0})) ) It is easier to follow what is happening if you create variables like beta_1 and beta_2 or prediction_1 and prediction_2 and perform the calculations with those. The expressions model.predict(df.assign(**{"x1":0, "x1x2":0}) inside another expression is more difficult to follow. – Sextus Empiricus Jun 27 '22 at 06:52
  • 1
    Another sidenote, this is not really the Delta method. Your variable is a linear combination for which the variance can be computed exactly. The Delta method is necessary when non-linear functions are applied. – Sextus Empiricus Jun 27 '22 at 07:01
  • Thanks for the comments, I didn't see them earlier. I put the link below step 2. I took it from the STATA's command document (page 311, equation 5). Also, for my actual use case, it involves hundreds of such coefficients, so using the delta method formula greatly simplifies the computation. – StatsNoob Jun 28 '22 at 05:47
  • I saw the link to the document below step 2, but I can not figure out which formula from that document you have used. The formula that you wrote in your answer is confusing. You sum over $i$ from $1$ to $n$ but it does not occur in any of the terms in the sum.... – Sextus Empiricus Jun 28 '22 at 07:15
  • 1
    ... I suspect that you wanted to use the following $$\frac{1}{n} \sum_{i=1}^n (\hat{\beta}0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x{2i}) - (\hat{\beta}0 + \hat{\beta}_1 (0) + \hat{\beta}_2 (0) x{2i})$$ instead of $$\frac{1}{n} \sum_{i=1}^n (\hat{\beta}_0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x_2) - (\hat{\beta}_0 + \hat{\beta}_1 (0) + \hat{\beta}_2 (0) x_2)$$ – Sextus Empiricus Jun 28 '22 at 07:22
  • Your actual application might involve the Delta method, but the case that you describe in your question is not. – Sextus Empiricus Jun 28 '22 at 07:24

1 Answers1

2

One thing you are doing wrong is that you do not take the square root of result: you are comparing a variance with a standard deviation.

The second thing, if I understood your code, is that in the computation of result the variance is generated only by the parameter estimation error, while in the computation of np.std(ames) the variance is generated not only by estimation error but also by sampling variability in the variables x1 and x2 (variability in the latter has a double effect in your model, as it contributes also to the average effect). To make a proper comparison you need to re-generate only the error terms epsilon (not the variables) in each simulation.

Let me provide a little more context: the standard errors computed with result are valid in a "fixed regressor" setting (either you have a small sample and the design matrix is fixed, or you have a large sample in which the statistics that depend on the design matrix converge in probability to a constant). If you run a bootstrap (which is what you call a simulation) to reproduce the standard errors of a "fixed regressor" setting, then you only need to randomly draw the errors (the regressors remain fixed). Instead, by randomly drawing also the regressors, you compute the standard errors for a "random regressor" setting. See, for example, this paper (p.254). This is not wrong, but since you want to reproduce the same result with two different methods, you need to keep the "fixed regressor" hypothesis in place.

Also note that, in your setting, not only "random regressors" inflate the variance of the OLS estimates, but they also inflate the variability of the average effect, which depends on the average value of one regressor.

user4422
  • 1,178
  • I fixed the noise adding component (if you saw my previous post, there was a bug, so you can ignore my earlier comments), and the numbers agree now. – StatsNoob Jun 27 '22 at 01:24
  • It seems I need to wait 19 hours to award you with 50 points. But I have some follow up questions if you could help me: – StatsNoob Jun 27 '22 at 01:26
  • the standard errors of x1 and x1:x2 are each 0.012 and 0.007. I thought the marginal error standard error should be bigger than either of these but it's instead some where in the middle at 0.09. Do you think this makes sense?
  • – StatsNoob Jun 27 '22 at 01:27
  • those standard errors of x1 and x1:x2 are valid whether I resample data points altogether or just resample the errors. Why is the marginal effect so sensitive to resampling of the data points? Or could you elaborate what you mean by "double effect"?
  • – StatsNoob Jun 27 '22 at 01:29
  • 1
    @StatsNoob 1. the formula for variance of a sum of variables is $$\sigma_{X+Y}^2 = \sigma_{X}^2+ \sigma_Y^2 + 2 \rho \sigma_{X}\sigma_{Y}$$ When the correlation $\rho$ is negative, then you can end up with smaller variance. A clear case is when $X = -Y$, ie they are exactly opposite. Then $X + Y = 0$ and the variance of the sum is zero. Example images can be seen in this question about MANOVA (https://stats.stackexchange.com/a/478166), the images show variables with correlation that have for some linear sum a smaller variance than the variance of the individual variables. – Sextus Empiricus Jun 27 '22 at 07:19
  • 1
    @StatsNoob 2. You computed the variance of $\beta_1 + \beta_2 x_2$ based on the variance of $\beta_1$ and $\beta_2$ computed for given vcov which is computed with fixed $x_1$ and $x_2$. In your simulation you had these varied. You used vcov = model.cov_params().values, but this model is changing in your simulations. – Sextus Empiricus Jun 27 '22 at 07:26
  • 1
    @StatsNoob I changed the answer to better explain the fixed vs random regressor problem. Hope this answers the questions. – user4422 Jun 27 '22 at 08:21
  • thank you so much for the detailed answer and the reference. I will look into it. Since for me, I'm trying to use it on AB experiment setting, the fixed regressor setting is okay. I will award the 50 points in an hour as the system tells me. – StatsNoob Jun 27 '22 at 20:01
  • But one last question if you could indulge me: – StatsNoob Jun 27 '22 at 20:02
  • For the standard error estimation from the regression model of individual coefficients, both ways of bootstrapping (fixed and random regressors) give the same results, and they both agree with regression model standard error estimates (which is why I thought random regressor method, which is what I was using at start, should also yield the same result as the "result" variable). Could you shed some light on why in case of the marginal effects standard error, the two ways of bootstrapping suddenly diverge? (or why they should be the same in individual coefficient case). Thanks you so much.
  • – StatsNoob Jun 27 '22 at 20:04