1

In the GBM package, we can specify which distribution to use that represents our response variable. I have count data and usually, we specify the distribution as Poisson for count data. But, when I check the distribution it is not Poisson distributed.

Is it okay if I specify the distribution in GBM as Poisson although the distribution is not Poisson distributed?

  • Just to note that a Poisson with a sufficient larger $\lambda$ (e.g. 50+) will look quite similar to a Gaussian. – usεr11852 Mar 22 '22 at 10:02
  • @usεr11852 I fit my data to Poisson, the λ is 0.05 but it is overdispersed. So is it still okay to use Poisson for gbm? – Dhestar Bagus Wirawan Mar 22 '22 at 10:47
  • Yeah, ultimately Poisson will be our loss function. What we care about is also our evaluation criteria (MAE I suppose?) so ultimately it is down to the validation schema used and how the evaluation metric aligns to our use of the model. – usεr11852 Mar 22 '22 at 13:24

1 Answers1

0

Yes, it is OK to use a Poisson family in a GBM even if the Poisson data are over-dispersed. Ultimately Poisson will be our loss function. What we care about is primarily our evaluation metric (MAE at first instance I suppose?) so it is down to the validation schema used and how the evaluation metric aligns to our use of the model. (CV.SE has a thread on "https://stats.stackexchange.com/questions/379264/" if one wants to explore this distinction further)

I would also note here that as $\lambda$ is quite low (~$0.05$), suggesting the potential for zero inflation, it make sense to consider a hurdle and/or a zero-inflated model too.

usεr11852
  • 44,125