5

What I am trying to do is to fit a log-normal distribution to a data-set, and then determine confidence and prediction intervals for the fitted distribution - not just for the mean and sd estimates.

My final goal is to be able to say that if we repeated a set of measurements, then 95 % of the values would fall below some specific value. I can obviously do that from the fitted distribution itself (or rather the cumulative version), but I'm thinking that will under-estimate the probability because I need to include the variability in the estimates and/or fit itself first?

If I use the following code:

require(MASS)
set.seed(123)
x<-rlnorm(100)
fit<-fitdistr(x,"lognormal")

then R will calculate a log-normal distribution fitted to my data. The fitdistr function will return the estimated mean and sd (along with standard errors for these estimates).

   meanlog       sdlog   
 0.09040591   0.90824033 
(0.09082403) (0.06422229)

I understand that these will then allow me to plot the fitted distribution (and histogram) using ggplot2 with the following code:

meanlog<-fit$estimate[[1]]
    sdlog<-fit$estimate[[2]]
binwidth<-abs(max(x)-min(x))/20

qplot(x,geom="blank")+geom_histogram(binwidth=binwidth,aes(y= ..density..))+stat_function(fun=dlnorm,arg=list(meanlog=meanlog,sdlog=sdlog),colour="red")

However, what I really want to do is to plot the confidence interval and/or prediction interval of this fitted distribution. Something similar to how ggplot2 does with stat_smooth, something like:

x<-seq(1,100)
y<-x+rnorm(x,sd=10)
qplot(x,y,geom="point")+stat_smooth(method='lm',se=T)

I can use confint(fit) to extract the confidence intervals for the estimated mean and sd, but I think I misunderstand the maths because I can't for the life of me work out how to use those in order to be able to calculate the confidence interval for the actual distribution. So neither can I work out the prediction interval. I've tried writing my own function for the log-normal distribution to input various combinations of the confidence intervals from confint manually - but that doesn't work. Obviously a confidence interval of the estimate does not directly give you the confidence interval of the line. And, therefore, neither the prediction interval.

I would really appreciate anyone who can walk me through this please!

Mooks
  • 716
  • The obvious approach is to form a prediction interval on the log scale (a normal PI) and exponentiate the endpoints. – Glen_b Jul 25 '14 at 15:42
  • Yeah, I feel a bit dopey for not realising that first. In my defence, it was at the end of a long day! – Mooks Jul 25 '14 at 19:14
  • Mooks: it works for prediction intervals, but doesn't work (as is) for a CI for the mean. – Glen_b Jul 26 '14 at 00:49
  • That's ok - I'm mainly interested in the prediction interval because the idea is to be able to say - if I repeat this measurement then I can say that 95 % (or whatever cut-off I choose) of values will be below some x. Thanks again. – Mooks Jul 28 '14 at 07:16

2 Answers2

8

Here is one simple approach:

> x.logmod = lm(log(x) ~ 1)
> exp(predict(x.logmod, newdata = data.frame(junk = 0), interval = "predict"))
       fit       lwr      upr
1 1.094619 0.1773106 6.757576

The linear model obtains the mean of $\log(x)$. The predict statement can compute a prediction interval for a new dataset, so if we un-transform it, we get a prediction interval for $x$ itself. The newdata argument may be skipped if you want 100 copies of the same interval! Instead, I provided a dataset that has just one row; since we are predicting the intercept, it doesn't matter what's in it.

Russ Lenth
  • 20,271
  • Thanks very much for that - embarrassingly, I never thought of transforming the data first! – Mooks Jul 25 '14 at 19:10
1

Here is a less orthodox answer to your question, based on bootstrap confidence intervals for the cumulative distribution function. The idea is as follows:

  1. Simulate N samples of size n (same size as the original sample).
  2. Calculate the MLE for each of the simulated samples.
  3. Calculate the estimated CDF for each of these MLE.
  4. Obtain a Bootstrap confidence interval using these samples (you have a variety of options for this.)
  5. Repeat this for a grid of points to obtain a bootstrap confidence band of the CDF.

The following R code implements this idea (it is a bit slow and it can be made more efficient with a bit of effort).

require(MASS)
set.seed(123)
x<-rlnorm(100)
fit<-fitdistr(x,"lognormal")

# N = number of simulations
# n = original sample size
# q = quantile or point of interest
# l = left limit of the confidence interval, e.g. 2.5%
# r = right limit of the confidence interval, e.g. 94.5%

# Boostrap confidence interval
C.I.F <- function(N,n,q,l,u){
CDF = vector()
for(i in 1:N){
sim = rlnorm(n,fit$estimate[1],fit$estimate[2])
fittemp <- fitdistr(sim,"lognormal")
CDF[i] = plnorm(q,fittemp$estimate[1],fittemp$estimate[2])
}
return(c(quantile(CDF,l),quantile(CDF,u)))
}

# Example with a single point
C.I.F(1000,100,0.5,0.025,0.975)

# Confidence bands
points = matrix(0,ncol=2,nrow=100)
xp = seq(0.1,8,length.out=100)

for(i in 1:100) points[i,] = C.I.F(1000,100,xp[i],0.025,0.975)
fitcdf = function(q) plnorm(q,fit$estimate[1],fit$estimate[2])

# Confidence bands
curve(fitcdf,0,8,lwd=2)
points(xp,points[,1],type="l",col="red",lwd=2)
points(xp,points[,2],type="l",col="red",lwd=2)
  • Thanks to you too. I think I will need to go away and digest this answer a bit, I really appreciate it though as I think I'll learn a few new tricks with this. – Mooks Jul 25 '14 at 19:12