Computing and simulating average marginal effect standard error using Delta Method with reproducible codes

Question

I am trying to simulate calculating Average Marginal Effects on a basic linear regression with interaction on a binary variable and compare the empirical standard deviation I get from simulations and the analytic standard error of the Average Marginal Effect I calculate using the Delta Method. But I keep getting a standard error that is way too small.

Below is my attempt with detailed explanation of each step:

Data Generation codes:

I am simulating data of the form:

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_1 x_2 + \epsilon$

# import packages
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
def make_data():
# preset beta values
beta0 = 1
beta1 = 2
beta2 = 3
betas = np.asarray([beta0, beta1, beta2])

n = 500
x1 = np.random.choice(2, size=[n, 1])  # x1 is binary!
x2 = np.random.normal(1, 1, size=[n, 1])
x1x2 = x1*x2
X = np.hstack([x1, x2, x1x2])
epsilon = np.random.normal(0, 0.1, size=[n, 1])
y = beta0 + x1*beta1 + x1x2*beta2 + epsilon

data = np.hstack([y, X])
df = pd.DataFrame(data=data, columns=['y', 'x1', 'x2', 'x1x2'])

return df, x1, x2, x1x2


df, x1, x2, x1x2 = make_data()

Calculate AME (average marginal effect)

I calculate the AME by difference formula for binary variables taken from the official doc of STATA here:

$\frac{1}{n} \sum_{i=1}^n (\hat{\beta}_0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x_2) - (\hat{\beta}_0 + \hat{\beta}_1 (0) + \hat{\beta}_2 (0) x_2) $

def compute_ame(df, x1, x2, x1x2):
formula = &quot;y ~ x1 + x1x2&quot;
model = smf.ols(formula, data=df).fit()

return np.mean( model.predict(df.assign(**{&quot;x1&quot;:1, &quot;x1x2&quot;:x2})) - model.predict(df.assign(**{&quot;x1&quot;:0, &quot;x1x2&quot;:0})) )


compute_ame(df, x1, x2, x1x2)

The output value is around 5, which makes sense since $\beta_1 = 2, \beta_2 = 3$, and $x_2 $ has a mean value of $1$. So from the equation, $y = \beta_0 + \beta_1 x_1 + \beta_2 x_1 x_2 + \epsilon$, turning $x_1$ on increases $y$ by 5.

Get the covariance matrix of the model

formula = "y ~ x1 + x1x2"
model = smf.ols(formula, data=df).fit()
vcov = model.cov_params().values

Get the gradient of AME with respect to each $\beta$'s

$\frac{\partial}{\partial \beta} \frac{1}{n} \sum_{i=1}^n (\hat{\beta}_0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x_2) - (\hat{\beta}_0 + \hat{\beta}_1 (0) + \hat{\beta}_2 (0) x_2) $

$= \big[ (1-1) \enspace\enspace 1 \enspace\enspace \frac{1}{n} \sum_{i=1}^n x_2 \big]$

We have one row matrix here because we are getting AME for one variable, namely the binary variable $x_1$

grad = np.asarray([0, 1, np.mean(x2)]).reshape([-1, 1])

Calculate the standard error:

$(grad)^T (vcov) (grad)$

result = grad.T @ vcov @ grad

Run some simulations to get many AME values so we can calculate their standard deviations:

ames = []
for i in range(1000):
df, x1, x2, x1x2 = make_data()
ame = compute_ame(df, x1, x2, x1x2)

ames.append(ame)


np.std(ames)

Compare the results:

print(np.std(ames))  # 0.13759581829
print(np.sqrt(result[0][0]))  # 0.0091748250468 from step 5

The analytic standard error is REALLY small.

What am I doing wrong here?

UPDATES (for answer by user4422):

I separate out variable generation from adding Gaussian noise to re-do the simulation as was advised:

def make_data():
# preset beta values
beta0 = 1
beta1 = 2
beta2 = 3
betas = np.asarray([beta0, beta1, beta2])

n = 500
x1 = np.random.choice(2, size=[n, 1])  # x1 is binary!
x2 = np.random.normal(1, 1, size=[n, 1])
x1x2 = x1*x2
X = np.hstack([x1, x2, x1x2])
# COMMENTED OUT NOISING CODES
# epsilon = np.random.normal(0, 0.1, size=[n, 1])
y = beta0 + x1*beta1 + x1x2*beta2 # + epsilon

data = np.hstack([y, X])
df = pd.DataFrame(data=data, columns=['y', 'x1', 'x2', 'x1x2'])

return df, x1, x2, x1x2


def add_noise(df):
n = 500
epsilon = np.random.normal(0, 0.1, size=n)
df_noised = df.copy()
df_noised['y'] += epsilon

return df_noised

The for loop now changes to first generate dataframe and then add new noise in each loop:

ames = []
df, x1, x2, x1x2 = make_data()  # generate variables here first
for i in range(1000):
df_noised = add_noise(df)  # only change the noise component
ame = compute_ame(df_noised, x1, x2, x1x2)
ames.append(ame)


np.std(ames)

The standard deviations are similar now

print(np.std(ames))  # 0.0091397150909881
print(np.sqrt(result[0][0]))  # 0.0091748250468 from step 5

$$\frac{1}{n}\sum_{i=1}^n (\hat{\beta}0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x_2)-(\hat{\beta}_0+\hat{\beta}_1(0)+ \hat{\beta}_2(0)x_2)$$ could you specify a bit more precise which formula from the source you applied here? You have these numbers in brackets $(0)$ and $(1)$ where do they come from? If I simplify it further, aren't you effectively computing the following $$\frac{1}{n}\sum{i=1}^n\hat{\beta}1+\hat{\beta}_2 x_2$$ I guess that those $1$'s and $0$'s are not correctly placed. It should be the values of $x{1j}$ (which is another thing, your expression does not use the index $i$. — Sextus Empiricus, Jun 27 '22 at 06:41
If you are writing a reproducible example then it could be nice to expand the following expression return np.mean( model.predict(df.assign(**{"x1":1, "x1x2":x2})) - model.predict(df.assign(**{"x1":0, "x1x2":0})) ) It is easier to follow what is happening if you create variables like beta_1 and beta_2 or prediction_1 and prediction_2 and perform the calculations with those. The expressions model.predict(df.assign(**{"x1":0, "x1x2":0}) inside another expression is more difficult to follow. — Sextus Empiricus, Jun 27 '22 at 06:52
Another sidenote, this is not really the Delta method. Your variable is a linear combination for which the variance can be computed exactly. The Delta method is necessary when non-linear functions are applied. — Sextus Empiricus, Jun 27 '22 at 07:01
Thanks for the comments, I didn't see them earlier. I put the link below step 2. I took it from the STATA's command document (page 311, equation 5). Also, for my actual use case, it involves hundreds of such coefficients, so using the delta method formula greatly simplifies the computation. — StatsNoob, Jun 28 '22 at 05:47
I saw the link to the document below step 2, but I can not figure out which formula from that document you have used. The formula that you wrote in your answer is confusing. You sum over $i$ from $1$ to $n$ but it does not occur in any of the terms in the sum.... — Sextus Empiricus, Jun 28 '22 at 07:15
... I suspect that you wanted to use the following $$\frac{1}{n} \sum_{i=1}^n (\hat{\beta}0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x{2i}) - (\hat{\beta}0 + \hat{\beta}_1 (0) + \hat{\beta}_2 (0) x{2i})$$ instead of $$\frac{1}{n} \sum_{i=1}^n (\hat{\beta}_0 + \hat{\beta}_1 (1) + \hat{\beta}_2 (1) x_2) - (\hat{\beta}_0 + \hat{\beta}_1 (0) + \hat{\beta}_2 (0) x_2)$$ — Sextus Empiricus, Jun 28 '22 at 07:22
Your actual application might involve the Delta method, but the case that you describe in your question is not. — Sextus Empiricus, Jun 28 '22 at 07:24

user4422 · Accepted Answer · 2022-06-27T08:20:12.233

One thing you are doing wrong is that you do not take the square root of result: you are comparing a variance with a standard deviation.

The second thing, if I understood your code, is that in the computation of result the variance is generated only by the parameter estimation error, while in the computation of np.std(ames) the variance is generated not only by estimation error but also by sampling variability in the variables x1 and x2 (variability in the latter has a double effect in your model, as it contributes also to the average effect). To make a proper comparison you need to re-generate only the error terms epsilon (not the variables) in each simulation.

Let me provide a little more context: the standard errors computed with result are valid in a "fixed regressor" setting (either you have a small sample and the design matrix is fixed, or you have a large sample in which the statistics that depend on the design matrix converge in probability to a constant). If you run a bootstrap (which is what you call a simulation) to reproduce the standard errors of a "fixed regressor" setting, then you only need to randomly draw the errors (the regressors remain fixed). Instead, by randomly drawing also the regressors, you compute the standard errors for a "random regressor" setting. See, for example, this paper (p.254). This is not wrong, but since you want to reproduce the same result with two different methods, you need to keep the "fixed regressor" hypothesis in place.

Also note that, in your setting, not only "random regressors" inflate the variance of the OLS estimates, but they also inflate the variability of the average effect, which depends on the average value of one regressor.

I fixed the noise adding component (if you saw my previous post, there was a bug, so you can ignore my earlier comments), and the numbers agree now. — StatsNoob, Jun 27 '22 at 01:24
It seems I need to wait 19 hours to award you with 50 points. But I have some follow up questions if you could help me: — StatsNoob, Jun 27 '22 at 01:26

Computing and simulating average marginal effect standard error using Delta Method with reproducible codes

1 Answers1