4

I'm trying to correct for measurement error in a regression model with multiple variables. Specifically, I'm trying to apply the result of http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1978.10480011. A necessary assumption is that the measurement error in different variables must be uncorrelated.

My regression model is

$y = X_1\beta_1 + X_2\beta_2 + X_2X_3\beta_3 + \epsilon$

Both $X_1$ and $X_2$ are measured with error, but the measurement error is uncorrelated. However, $X_3$ is measured without error.

My intuition is that the measurement error of $X_2$ and $X_2X_3$ are correlated, however I would like to show this either formally or using Monte Carlo simulation.

How can I do either to verify my intuition?

wwl
  • 688

2 Answers2

4

To support your intuition, suppose the measurement error is additive, so that $X_2=\xi_2 + \epsilon_2$ for a zero-mean random variable $\epsilon_2$, and otherwise $\xi_2$ and $X_3$ are uncorrelated random variables (as are $\epsilon_2$ and $X_3$). Then the measurement error of $X_2X_3$ is $\epsilon_2X_3$, whence the covariance of the two measurement errors is

$$\eqalign{ \operatorname{Cov}(\epsilon_2,\epsilon_2 X_3)&=\mathbb{E}(X_3)\operatorname{Var}(\epsilon_2) }$$

This immediately shows they are correlated whenever $X_3$ has a nonzero expectation. We can compute the correlation in terms of $\sigma^2 =\operatorname{Var}(\epsilon_2)$ and moments of $X_3$ by evaluating

$$\operatorname{Var}(\epsilon_2X_3) = \operatorname{Var}(\epsilon_2)\mathbb{E}(X_3^2) = \sigma^2\mathbb{E}(X_3^2),$$

whence the correlation is

$$\rho(\epsilon_2, \epsilon_2 X_3) = \frac{\mathbb{E}(X_3)}{\sqrt{\mathbb{E}(X_3^2)}}.\tag{1}$$

This is zero if and only if $X_3$ has zero expectation.


Simulations support this result. The following draws $100$ observations of the measurement error $\epsilon_2$ from a standard Normal distribution and, independently, $100$ observations of $X_3$ from a Normal$(m,1)$ distribution where $m\in\{-2,-1,0,1,2\}$. These are paired and then the correlation of the $(\epsilon_2, \epsilon_2X_3)$ dataset is computed. This process is repeated $1000$ times, producing a distribution of sample correlation coefficients. These will only approximate the true correlation coefficients, of course. But if the theory is correct, each distribution should be spread tightly around the value given by formula $(1)$. Because the expectation of $X_3^2$ is $m^2+1$, that value is

$$\rho(\epsilon_2, X_3) = \frac{m}{\sqrt{m^2+1}}.$$

The software draws these five histograms and overplots them with vertical dashed red lines situated at this value. We check that (a) these lines are close to the middle of each histogram and (b) the histograms are fairly tightly (and unimodally) spread around the lines.

Figure

All is as expected.

Here is the R code.

n <- 1e2                # Size of each sample
means <- (-2):2         # Set of means to analyze
sim <- replicate(1e3, { # The simulation
  epsilon.2 <- rnorm(n)
  X.3 <- rnorm(n)
  sapply(means, function(m) cor(epsilon.2, epsilon.2 * (X.3+m)))
})
#
# Display the results.
#
par(mfrow=c(1,length(means)))
sapply(1:length(means), function(i) {
  hist(sim[i, ], main=paste("Mean =", means[i]), xlab="Sample correlation")
  abline(v = means[i]/sqrt(means[i]^2+1), col="Red", lty=3, lwd=2)
})
whuber
  • 322,774
0

To use Monte Carlo simulation, create an Excel sheet.

I use the following notation:

  • X1 refer to the true value of a variable called X1
  • X1* be the observed value of variable X1.
  • x (in smallcase) denotes multiplication

Then the Excel sheet has the following columns:

  • X1: data
  • X2: data
  • X3: data
  • error1: error in X1 (e.g. use RANDBETWEEN(-5,5))
  • error2: error in X2
  • X1*: X1 + error1
  • X2*: X2 + error2
  • X2*xX3: the product of X2* and X3
  • error2*x3: this is simply X2*xX3 minus X2xX3

One can then use CORREL function to calculate the correlation between the columns error2*x3 and error2.

I got a correlation of around 0 when X1, X2, and X3 were random integers between -10 and 10 and the error terms were random integers between -5 and 5.

However, when I adjusted X3 to be a random integer between 0 and 20, I got a correlation of around 0.9. This shows that it doesn't hold in the general case (as proven by whuber).

(It would be nice if anyone happens to have a formal proof)

wwl
  • 688
  • I cannot reproduce your results--and I obtain a theoretical result that is quite different. – whuber Jan 12 '17 at 16:21
  • Sorry, typo. I edited my answer. – wwl Jan 12 '17 at 16:34
  • 1
    Now it makes sense: according to your edit, you get $0.9$ when $X_3$ is uniformly distributed among the integers ${0,1,\ldots, 20}$. Thus its mean is $10$ and its expected square is $410/3$, whence the correlation (according to my formula (1)) is $10/\sqrt{410/3+1}\approx 0.853$. Note that since you never use $X_1$ or $X_2$ in computing the correlation, you don't have to simulate them in your worksheet. That reduces the effort by two-thirds. – whuber Jan 12 '17 at 17:09