Function fit to skewed data and non-zero beginning of the function

Question

I would like to find a function that would represent the best fit to represent this type of biological data. More precisely, I would like to estimate expected daily egg production by an insect, based on experimental data derived from a number of individuals. It is not a frequency distribution, but I could use probability density functions such as Gamma or log-normal etc. Any suggestions? The purpose is to use this function to predict the expected daily egg production of an insect during its lifespan (defined as physiological age on x-axis).

Below is my attempt (in red) on fitting the Gamma PDF: gamma_fun <- function(x, a, b){(x^(a-1))*exp(-x/b)} The problem is that the Gamma distribution function starts at 0, and my data does not.

The question cannot be answered without knowing your ultimate goal. Please edit the post. — Frank Harrell, Nov 01 '21 at 12:54
Better; please add why you think fitting a curve would allow you to better estimate the mean. — Frank Harrell, Nov 01 '21 at 13:09
What exactly is being measured and represented on this plot? It is labeled "mean daily production," suggesting that multiple observations are being averaged. For your prediction purposes, wouldn't it be more relevant to track the egg-laying history of individual subjects? — whuber, Nov 01 '21 at 13:44
Actually, it is daily egg production - data taken from a number of individual insects. I will correct labelling the y axis shortly. — MIH, Nov 01 '21 at 15:16
The issue is the nature of your question. You write that you want to "estimate expected daily egg production by an insect." That's not the same as estimating averages over all insects. — whuber, Nov 01 '21 at 15:38
You seem to be mixing up time series modeling (production of eggs as a function of day) with the two-dimensional plot of a Gamma distribution PDF. There is no reason why you can't use the equation of a gamma PDF as a regression model, but then that equation just happens to be of the same form as a PDF. This reminds me of an answer when we were discussing the COVID curve last year; please see the first point in the answer by Alexis. — Dave, Nov 01 '21 at 15:45

BruceET · Answer 1 · 2021-11-02T00:33:53.810

I'm not sure that trying to give a name to the distribution of eggs per day for such an insect is the most fruitful approach

Suppose you have $n = 100$ fictitious observations in vector x in R, with summary information as follows:

summary(x); length(x); sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.723  15.583  19.227  19.955  24.130  33.040 
[1] 100
[1] 5.532306
hist(x, prob=T, col="skyblue2");  rug(x)

So a point estimation of the population mean $\mu$ is $\bar X = 19.955$ and the point estimate of the population standard deviation is $S = 5.532.$ The shape of the histogram suggests that the population distribution is mildly right skewed and thus not normal. So a 95% t confidence interval might not give the best idea how good the estimate of $\mu$ is. Just for reference, that interval $\bar X \pm t^*S/\sqrt{100},$ where $t^*$ cuts probability $0.025$ from the upper tail of Student's t distribution with $\nu = n-1 = 99$ degrees of freedom. From R, this 95% CI computes to $(18.86, 21.05).$

t.test(x)$conf.int
[1] 18.85748 21.05294
attr(,"conf.level")
[1] 0.95

One style of 95% nonparametric bootstrap confidence interval that works well for moderately skewed data is illustrated below. The bootstrap repeatedly take re-samples of size 100 from x with replacement. By finding the means of these re-samples and seeing how far they lie from the observed $\bar X$ of the data (denoted as a.obs in the program below), we can get an idea of the variability of sample means, and thus make an approximate 95% nonparametric CI, $(18.84, 21.07)$ without assuming data are normal (or that data take any other particular distribution). (We do assume that that the population has a mean.)

set.seed(1101)
a.obs = mean(x)
d.re = replicate(2000, mean(sample(x,100,rep=T)) - a.obs)
UL = quantile(d.re, c(.975,.025))
a.obs - UL
   97.5%     2.5% 
18.83724 21.06752

Of course, in this case, there is not much difference between the questionable t CI and the more generally applicable bootstrap CI. The issue is that one can never quite know for sure how well the t interval will work for noticeably non-normal data. [It can be risky to rely on the often quoted, but not successfully defended, "rule of 30." Some authors seem to claim that t CIs are guaranteed reliable if based on a sample of size 30 or more.]

Notes: (1) If you were sure that your data are gamma distributed, the you might use a confidence interval derived specifically for gamma data.

(2) My fictitious data x were sampled in R as shown below,

set.seed(2010)
x = rgamma(100, 10, .5)

Function fit to skewed data and non-zero beginning of the function

1 Answers1