Calculating standard error after a log-transform

Question

Consider a random set of numbers that are normally distributed:

x <- rnorm(n=1000, mean=10)

We'd like to know the mean and the standard error on the mean so we do the following:

se <- function(x) { sd(x)/sqrt(length(x)) }
mean(x) # something near 10.0 units
se(x)   # something near 0.03 units

Great!

However, let's assume we don't necessarily know that our original distribution follows a normal distribution. We log-transform the data and perform the same standard error calculation.

z <- log(x, base=10)
mean(z) # something near 1 log units
se(z)   # something near 0.001 log units

Cool, but now we need to back-transform to get our answer in units NOT log units.

10^mean(z) # something near 10.0 units
10^se(z)   # something near 1.00 units

My question: Why, for a normal distribution, does the standard error differ depending on whether it was calculated from the distribution itself or if it was transformed, calculated, and back-transformed? Note: the means came out the same regardless of the transformation.

EDIT #1: Ultimately, I am interested in calculating a mean and confidence intervals for non-normally distributed data, so if you can give some guidance on how to calculate 95% CI's on transformed data including how to back-transform to their native units, I would appreciate it!
END EDIT #1

EDIT #2: I tried using the quantile function to get the 95% confidence intervals:

quantile(x, probs = c(0.05, 0.95))     # around [8.3, 11.6]
10^quantile(z, probs = c(0.05, 0.95))  # around [8.3, 11.6]

So, that converged on the same answer, which is good. However, using this method doesn't provide the exact same interval using non-normal data with "small" sample sizes:

t <- rlnorm(10)
mean(t)                            # around 1.46 units
10^mean(log(t, base=10))           # around 0.92 units
quantile(t, probs = c(0.05, 0.95))                     # around [0.211, 4.79]
10^(quantile(log(t, base=10), probs = c(0.05, 0.95)))  # around [0.209, 4.28]

Which method would be considered "more correct". I assume one would pick the most conservative estimate?

As an example, would you report this result for the non-normal data (t) as having a mean of 0.92 units with a 95% confidence interval of [0.211, 4.79]?
END EDIT #2

Thanks for your time!

Thanks! I fixed that problem. The issue I am having remains though. — baffled, Nov 12 '14 at 04:11

Glen_b · Accepted Answer · 2022-03-01T23:00:52.290

Your main problem with the initial calculation is there's no good reason why $e^{\text{sd}(\log(Y))}$ should be like $\text{sd}(Y)$. It's generally quite different.

In some situations, you can compute a rough approximation of $\text{sd}(Y)$ from $\text{sd}(\log(Y))$ via Taylor expansion.

$$\text{Var}(g(X))\approx \left(g'(\mu_X)\right)^2\sigma^2_X\,.$$

If we consider $X$ to be the random variable on the log scale, here, $g(X)=\exp(X)$

If $\text{Var}(\exp(X))\approx \exp(\mu_X)^2\sigma_X^2$

then $\text{sd}(\exp(X))\approx \exp(\mu_X)\sigma_X$

These notions carry across to sampling distributions.

This tends to work reasonably well if the standard deviation is really small compared to the mean, as in your example.

> mean(y)
[1] 10
> sd(y)
[1] 0.03
> lm=mean(log(y))
> ls=sd(log(y))
> exp(lm)*ls
[1] 0.0300104

If you want to transform a CI for a parameter, that works by transforming the endpoints.

If you're trying to transform back to obtain point estimate and interval for the mean on the original (unlogged) scale, you will also want to unbias the estimate of the mean (see the above link): $E(\exp(X))\approx \exp(\mu_X)\cdot (1+\sigma_X^2/2)$, so a (very) rough large sample interval for the mean might be $(c.\exp(L),c.\exp(U))$, where $L,U$ are the upper and lower limits of a log-scale interval, and $c$ is some consistent estimate of $1+\sigma_X^2/2$.

If your data are approximately normal on the log scale, you may want to treat it as a problem of producing an interval for a lognormal mean.

(There are other approaches to unbiasing mean estimates across transformations; e.g. see Duan, N., 1983. Smearing estimate: a nonparametric retransformation method. JASA, 78, 605-610)

I don't have the reputation to comment, but just in case another curious soul happens across this post, from looking at the Taylor expansion link on Wikipedia, the correct estimate for the mean should be $$\begin{eqnarray}\text{E}[f(X)] &\approx& f(\mu_X)+\frac{f^{\prime\prime}(\mu_X)}{2}\sigma_X^2\ &=& \exp(\mu_X)\left(1 +\frac{\sigma_X^2}{2}\right) \end{eqnarray}$$ Otherwise as if $\exp(\mu_x)\gg\sigma_X^2$, you might underestimate $\text{E}[\exp(X)]$ — deasmhumnha, Dec 19 '16 at 18:13
Thanks @Dezmond. Yes, that's correct. I'll add in a correction to my answer, that part of it near the end is quite mangled. — Glen_b, Dec 19 '16 at 22:48

score 1 · Answer 2 · answered Jan 11 '17 at 00:32

It sounds like you effectively want the geometric standard error, akin to the geometric mean exp(mean(log(x))).

While it might seem reasonable to compute that as:

exp(sd(log(x)/sqrt(n-1)))

You and others have already pointed out that that isn't correct for a few reasons. Instead, use:

exp(mean(log(x))) * (sd(log(x))/sqrt(n-1))

Which is the geometric mean multiplied by the log-standard error. This should approximate the "natural" standard error pretty well.

Source: https://www.jstor.org/stable/pdf/2235723.pdf

Calculating standard error after a log-transform

2 Answers2

Linked

Related