0

I have some function that according to literature should approximate a log normal distribution. To fit it, I am using a hacky (yet legitimate) approach to fit pre binned data:

from scipy import stats

counts, bin_centers = np.array(distr), np.linspace(0, len(distr)+0, len(distr)) # where bins are bin edges on x axis

restore data from hist: count multiplied bin centers

restored = [[d]*int(counts[n]) for n,d in enumerate(bin_centers)]

flatten result

restored = np.hstack(np.array(restored))

dist = stats.lognorm(*stats.lognorm.fit(restored)) x = np.arange(0,max(bin_centers)) y = dist.pdf(x)

PDF is normalized, so scale it to match hist

y= y/y.max() y = y*counts.max()

This almost works, in that modeled fit seems to be off. Is this a consequence of my data potentially not being actually log normal to begin with? Or some better approach is possible?

enter image description here

  • 1
    You have a lot of zeros on the right end. This is absolutely not lognormal, and trying to fit these will throw off the fit in other parts of the distribution, too. Can you restrict your range to the domain of $\theta$ with positive mass? – Stephan Kolassa Feb 15 '23 at 09:17
  • 1
    Something is seriously wrong with this, because on a logarithmic scale of $\theta$ the distribution should look Normal -- which is not skewed. For a correct approach to fitting Lognormal parameters to binned data, see https://stats.stackexchange.com/a/56100/919 for instance. – whuber Feb 15 '23 at 14:25
  • 1
    For slightly I would read very. – Nick Cox Feb 15 '23 at 15:41

1 Answers1

2

I can't read this code, but it looks like you're binning these data then using the approach from the linked post to calculate the fit. Whatever lack of precision you may have in the full data will only be worse when the data are aggregated this way. Unless your $n$ is ginormous, we'd expect "data" to be off a few ticks from "fit".

I agree the method is hacky, but I do not agree that it is legitimate. You have imputed the "midpoint" as the observed value for each bin. But you will observe that no possible bin is symmetrically distributed, so the actual mean response conditional on falling in a particular bin is different than the midpoint. There are legitimate, yet sophisticated, ways to impute a binned response using the EM algorithm which yield efficient and unbiased estimates of the model parameters.

That seems to be beside the point! If the data are not actually binned, then why not just use maximum likelihood estimation? I would assume this is exactly what lognorm.dist is doing in Python.

AdamO
  • 62,637