Bias Correction and Variance Estimation of Inverse Log-Linear Regression

Question

My question is related to this one, but it has no answers or comments. So, I try to ask with a different way.

I have a physical quantity $X > 0$ and a noisy sensor $Y$ which has the following relationship with $X$:

$$Y = a \log X + b + \varepsilon, ~~~~ \varepsilon \sim \mathcal{N}(0, \sigma^2)$$

Assuming $a, b, \sigma$ are known I want to calculate unbiased estimate of $X$ and its variance from samples of $Y$. I have collected real measurements to test the model and by the design of the experiment the values of $X$ are known precisely (denoted by $x$). The collected data matches the following distribution as expected:

$$ Y(x) = (Y | X=x) \sim \mathcal{N}(a \log x + b, \sigma^2) $$

Now using the measurements of $\{Y_i\}$ for a fixed and unknown $x$, I want to estimate the value of $x$, i.e. $\hat{x} := E(X | \{Y_i\})$ and its variance, $\hat{\sigma}^2 := \operatorname{Var}(\hat{x})$.

Naturally, I tried

$$ \hat{x}_L := E(\log X) = \frac{E(\{Y_i\})-b}{a} ~~\text{and}~~ \hat{\sigma}_L^2 :=\operatorname{Var}(\log X) = \frac{\sigma^2}{a^2} $$

which produced pretty good results for the test data.

Now when estimating $X$ neither the direct estimate $e^{\hat{x}_L}$, or the adjusted estimate $e^{\hat{x}_L + \hat{\sigma}_L^2/2}$ produced the correct result. Instead $e^{\hat{x}_L - \hat{\sigma}_L^2/2}$ gives the correct estimate.

So, my questions are:

What is the justification of the correction term $e^{- \hat{\sigma}_L^2/2}$? I saw the term "convexity correction" but cannot find a good source of its derivation.
What happens to the variance if I use this correction term?

Can you include your data or simulations that justify the ad hoc expression with the minus sign giving a correct estimate? — jwimberley, Mar 22 '22 at 21:03

jwimberley · Answer 1 · 2022-03-23T13:40:05.630

It's necessary here to distinguish the distribution of $X$, which is restricted to $X>0$ but is otherwise unparametrized, from the conditional distribution of individual sample values $X_i$ given the noisy measurements $Y_i$, and finally from the posterior distributions of summary statistics about $X$ (such as its mean and its variance) given the whole set of measurements $\{Y_i\}$. Because the measurement of $Y$ is noisy, the empirical variance of the data $$ \left\{ X'_i = \exp\left( \frac{Y_i-b}{a} \right) \right\} $$ will be tend to be larger than the variance of $X$ itself -- the distribution random variable $\log X'$ is essentially the convolution of the (unknown) distribution of $\log X$ with the noise distribution of the sensor.

A hand-waving argument (certainly non-rigorous) is that, as the variance of the convolution is the sum of the original variances, it should hold that $$ \sigma^2 \left[ \log X' \right] \sim \sigma^2 \left[ \log X \right] + \sigma^2 \left[ \frac{\epsilon}{a} \right] $$ When trying to estimate the mean of $X$ itself, the first order correction comes from the first term on the RHS. Letting $\Sigma$ be the sample standard deviation of $Y$, the same standard deviation of $\log X'$ is $\Sigma/a$ and so, given the assumption of normal noise, $$ \sigma^2 \left[ \log X \right] \sim \frac{\Sigma^2 - \sigma^2}{a^2} $$ This is a formal relationship, and the right hand side could in practice be negative (e.g. if $\Sigma$ is a sample standard deviation and there are small statistics, or if the supposedly "known" value \sigma is overestimated). With this grain of salt, $$ \exp \left( \mu_{\log X'} + \frac{1}{2} \sigma^2 \left[ \log X \right]\right), $$ where $\mu_{X'}$ is the sample mean of $\{ \log X'_i \}$, should include the first order correction to the mean of $X$ as a function of summary statistics of $\log X$.

This is borne out in a simulation run using a test distribution for $X$ (the agreement in the simulation is artificially good since X is modeled to be log-normal):

N <- 1000000
a <- 0.2
b <- 0.7
s <- 0.25
logX <- rnorm(N,mean=1,sd=1)
X <- exp(logX)
Y <- a*logX + b + rnorm(N,sd=s)
Z <- (Y-b)/a # log X with noise
sp <- s/a
# Let A = log X, and let cN be the central moments of A
# Let Z = A + B, where B is standard normal with variance sp.
# Then mean(Z) = mean(A) = c1
c1 <- mean(Z)
# For the higher central moments of Z, let A' = A - mean(A)
# (Z-c1)^2 = A^2 + 2AB + B^2 => c2 + sp^2 in expectation
c2 <- var(Z) - sp**2
print(mean(X))
print(exp(c1))
print(exp(c1+c2/2))

Formally, this expression contains a term $-\sigma^2/a^2$ in the exponent, in apparent correspondence to the estimator claimed in the question to the give the correct result -- in essence, this is a deconvolution term -- but it also includes a non-negative contribution proportional to the sample variance of the measurements $Y$.

Of course, if the true distribution of $\log X$ has non-negligible third and higher central moments, the above approximation may not be sufficient. It can be extended trivially using the cumulants of the random variable $\log X$:

N <- 1000000
a <- 0.2
b <- 0.7
s <- 0.25
X <- rexp(N,rate=5)
logX <- log(X)
Y <- a*logX + b + rnorm(N,sd=s)
Z <- (Y-b)/a # log X with noise
sp <- s/a
# Let A = log X, and let cN be the central moments of A
# Let Z = A + B, where B is standard normal with variance sp.
# Then mean(Z) = mean(A) = c1
c1 <- mean(Z)
# For the higher central moments of Z, let A' = A - mean(A)
# (Z-c1)^2 = A^2 + 2AB + B^2 => c2 + sp^2 in expectation
c2 <- var(Z) - sp**2
# (Z-c1)^3 = A^3 + 3 A^2 B + 3 A B^2 + B^3 => + 2AB + B^2 => c3 in expectation
c3 <-  mean((Z-mean(Z))^3)
# (Z-c1)^4 = A^4 + 4 A^3 B + 6 A^2 B^2 + 4 A B^3 + B^5 => c4 + 6 c2 sp^2 + 3 sp^4
c4 <-  mean((Z-mean(Z))^4) - 6*c2*sp**2 -3*sp**4
# (Z-c1)^5 = A^5 + 5 A^4 B + 10 A^3 B^2 + 10 A^2 B^3 + 4 A B^4 + B^5
# => c5 + 10 c3 sp^2
c5 <-  mean((Z-mean(Z))^5) - 10*c3*sp**2
# Cumulant expansion of expected value of X:
k1 = c1
k2 = c2
k3 = c3
k4 = c4 - 3*c2**2
k5 = c5 - 10*c3*c2
cumulant_approximations <-
c(exp(k1),
  exp(k1+k2/2),
  exp(k1+k2/2+k3/6),
  exp(k1+k2/2+k3/6+k4/24),
  exp(k1+k2/2+k3/6+k4/24+k5/120))
plot(cumulant_approximations,type="l")
abline(h=mean(X),col="red")

EdM · Answer 2 · 2022-03-24T13:15:57.983

I believe that the correction factor you first tried is used when modeling $\log Y$ as a linear function of $X$. See this thread. Here you are modeling $Y$ as a function of $\log X$, with values $x$ specified precisely as part of the experiment, to obtain a calibration curve. You then later back-transform new observed values $y_{new}$ to estimate new values $x_{new}$.

In your terminology, with $X_L = \frac{Y-b}{a}$, $X_L$ has an assumed normal distribution with variance $\sigma_L^2$. Thus $Z=\exp(X_L)$, which you use for your estimates from the calibration curve, has a log-normal distribution with parameters $\mu$ and $\sigma_L^2$ for an observation $y_{new}$. Here $\mu$ is the true mean of the $\log x_{new}$ value associated with that observation, and we assume that $\sigma_L^2$ is known. What you want is an estimate $\hat \mu$ (of $\mu$, the true $\log x_{new}$), which you then exponentiate to estimate $\hat x_{obs}=\exp(\hat\mu)$.

The problem is that you are getting estimates from $Z=\exp(X_L)$. If $\mu=\log(x_{new})$ is the true value that you seek, $Z$ has an expected value $$\mathbb{E}(Z)=\exp\left(\mu + \left(\sigma_L^2/2\right)\right)=\exp(\mu)\exp(\sigma_L^2/2).$$

Thus $\exp(\mu) = \mathbb{E}(Z) \exp(-\sigma_L^2/2)$, as you found.

The following plot shows this bias and its correction in a simulation of a calibration experiment like yours. Values $y$ were distributed normally around a series of specified $\log x$ values ($x$ running from 0.5 to 20) plus a normally distributed error with standard deviation of 1. The $y$ values were then used to predict the corresponding $x$ values, either uncorrected via $\exp(y)$ or bias-corrected via $\exp(y-0.5)$. Mean predicted values at each of the original $x$ values are shown for both estimates. The solid line is the line of identity.

I suppose that you could calculate the variance, but with the exponentiation it probably makes more sense to calculate confidence intervals. For 95% CI, just start with what you call $x_L$, add/subtract 1.96 $\sigma_L$, and exponentiate. Note that to get the correct coverage you don't use the bias correction; see code below. This all assumes that you actually know $\sigma_L^2$; I haven't thought through how to account for basing this on a sample estimate $\hat \sigma_L^2$.

Code for plot and CI:

xvals <- rep(seq(0.5,20,by=0.5),1000)
xvals <- xvals[order(xvals)]
xvalClass <- as.character(xvals)
set.seed(1024)
yvals <- log(xvals) + rnorm(length(xvals),mean=0,sd=1)
df1 <- data.frame(x=xvals,y=yvals,calGroup=xvalClass)
df1$calGroup <- as.factor(df1$calGroup)
df1$calGroup <- reorder(df1$calGroup,as.numeric(as.character(df1$calGroup)))
ymod <- lm(y~log(x),data=df1)
summary(ymod) ## not shown; very close to ideal
df1[,"xUncorr"] <- exp(df1$y)
df1[,"xCorr"] <- exp(df1$y-(1/2))
agg1 <- aggregate(xUncorr~calGroup,data=df1,FUN=mean)
agg2 <- aggregate(xCorr~calGroup,data=df1,FUN=mean)
dfDisplay<- data.frame(trueX=as.numeric(as.character(agg1$calGroup)),uncorr=agg1$xUncorr,corr=agg2$xCorr)
plot(uncorr~trueX,dfDisplay,bty="n",ylab="Estimated x",xlab= "True x")
abline(0,1,col="red")
points(corr~trueX,dfDisplay,col="red")
legend("topleft",legend="Black points, uncorrected\nRed points, corrected",bty="n")
for 95% CI
df1$xCorrL <- exp(df1$y-1.96)
df1$xCorrU <- exp(df1$y+1.96)
df1$xInCI <- df1$x > df1$xCorrL & df1$x < df1$xCorrU
mean(df1$xInCI)
[1] 0.949525
shows desired coverage

Bias Correction and Variance Estimation of Inverse Log-Linear Regression

2 Answers2

for 95% CI

[1] 0.949525

shows desired coverage