Measurement error in variables

Question

I'm trying to correct for measurement error in a regression model with multiple variables. Specifically, I'm trying to apply the result of http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1978.10480011. A necessary assumption is that the measurement error in different variables must be uncorrelated.

My regression model is

$y = X_1\beta_1 + X_2\beta_2 + X_2X_3\beta_3 + \epsilon$

Both $X_1$ and $X_2$ are measured with error, but the measurement error is uncorrelated. However, $X_3$ is measured without error.

My intuition is that the measurement error of $X_2$ and $X_2X_3$ are correlated, however I would like to show this either formally or using Monte Carlo simulation.

How can I do either to verify my intuition?

Does y have an error term also? $y=X_1 B_1+ X_2 B_2 +X_2 X_3 B_3 + e$? — Michael R. Chernick, Jan 12 '17 at 15:07
Search this site for Deming regression or errors in variables (many posts) — kjetil b halvorsen, Apr 09 '17 at 09:54

score 4 · Accepted Answer · answered Jan 12 '17 at 16:21

To support your intuition, suppose the measurement error is additive, so that $X_2=\xi_2 + \epsilon_2$ for a zero-mean random variable $\epsilon_2$, and otherwise $\xi_2$ and $X_3$ are uncorrelated random variables (as are $\epsilon_2$ and $X_3$). Then the measurement error of $X_2X_3$ is $\epsilon_2X_3$, whence the covariance of the two measurement errors is

$$\eqalign{ \operatorname{Cov}(\epsilon_2,\epsilon_2 X_3)&=\mathbb{E}(X_3)\operatorname{Var}(\epsilon_2) }$$

This immediately shows they are correlated whenever $X_3$ has a nonzero expectation. We can compute the correlation in terms of $\sigma^2 =\operatorname{Var}(\epsilon_2)$ and moments of $X_3$ by evaluating

$$\operatorname{Var}(\epsilon_2X_3) = \operatorname{Var}(\epsilon_2)\mathbb{E}(X_3^2) = \sigma^2\mathbb{E}(X_3^2),$$

whence the correlation is

$$\rho(\epsilon_2, \epsilon_2 X_3) = \frac{\mathbb{E}(X_3)}{\sqrt{\mathbb{E}(X_3^2)}}.\tag{1}$$

This is zero if and only if $X_3$ has zero expectation.

Simulations support this result. The following draws $100$ observations of the measurement error $\epsilon_2$ from a standard Normal distribution and, independently, $100$ observations of $X_3$ from a Normal$(m,1)$ distribution where $m\in\{-2,-1,0,1,2\}$. These are paired and then the correlation of the $(\epsilon_2, \epsilon_2X_3)$ dataset is computed. This process is repeated $1000$ times, producing a distribution of sample correlation coefficients. These will only approximate the true correlation coefficients, of course. But if the theory is correct, each distribution should be spread tightly around the value given by formula $(1)$. Because the expectation of $X_3^2$ is $m^2+1$, that value is

$$\rho(\epsilon_2, X_3) = \frac{m}{\sqrt{m^2+1}}.$$

The software draws these five histograms and overplots them with vertical dashed red lines situated at this value. We check that (a) these lines are close to the middle of each histogram and (b) the histograms are fairly tightly (and unimodally) spread around the lines.

All is as expected.

Here is the R code.

n <- 1e2                # Size of each sample
means <- (-2):2         # Set of means to analyze
sim <- replicate(1e3, { # The simulation
  epsilon.2 <- rnorm(n)
  X.3 <- rnorm(n)
  sapply(means, function(m) cor(epsilon.2, epsilon.2 * (X.3+m)))
})
#
# Display the results.
#
par(mfrow=c(1,length(means)))
sapply(1:length(means), function(i) {
  hist(sim[i, ], main=paste("Mean =", means[i]), xlab="Sample correlation")
  abline(v = means[i]/sqrt(means[i]^2+1), col="Red", lty=3, lwd=2)
})

wwl · Answer 2 · 2017-01-12T16:34:18.220

0

To use Monte Carlo simulation, create an Excel sheet.

I use the following notation:

X1 refer to the true value of a variable called X1
X1* be the observed value of variable X1.
x (in smallcase) denotes multiplication

Then the Excel sheet has the following columns:

X1: data
X2: data
X3: data
error1: error in X1 (e.g. use RANDBETWEEN(-5,5))
error2: error in X2
X1*: X1 + error1
X2*: X2 + error2
X2*xX3: the product of X2* and X3
error2*x3: this is simply X2*xX3 minus X2xX3

One can then use CORREL function to calculate the correlation between the columns error2*x3 and error2.

I got a correlation of around 0 when X1, X2, and X3 were random integers between -10 and 10 and the error terms were random integers between -5 and 5.

However, when I adjusted X3 to be a random integer between 0 and 20, I got a correlation of around 0.9. This shows that it doesn't hold in the general case (as proven by whuber).

(It would be nice if anyone happens to have a formal proof)

edited Jan 12 '17 at 16:34

answered Jan 12 '17 at 15:32

wwl

688

I cannot reproduce your results--and I obtain a theoretical result that is quite different. – whuber Jan 12 '17 at 16:21
Sorry, typo. I edited my answer. – wwl Jan 12 '17 at 16:34
1

Now it makes sense: according to your edit, you get $0.9$ when $X_3$ is uniformly distributed among the integers ${0,1,\ldots, 20}$. Thus its mean is $10$ and its expected square is $410/3$, whence the correlation (according to my formula (1)) is $10/\sqrt{410/3+1}\approx 0.853$. Note that since you never use $X_1$ or $X_2$ in computing the correlation, you don't have to simulate them in your worksheet. That reduces the effort by two-thirds. – whuber Jan 12 '17 at 17:09

Measurement error in variables

2 Answers2

Linked