1

I have data that is $\mathcal{NB}$-distributed. I have only ~100 data points in one sample. I have tried to fit Negative Binomial to the data, but at first I decided to do simulations:

library(MASS)
vect <- rnbinom(100, size=x, mu=y)
fitdistr(vect)

The accuracy of estimation of $mu$ parameter is quite high, but $size$ is estimated ±50 (I can provide the exact plot).

What is better to do: to use NB with wrongly estimated $size$ (it does not influence the shape of distribution a lot) or use sqrt/log/Box-Cox and normal distribution assumption? I can windsorize normal distribution.

The goal: assume, that we have 100 sets, each element is a vector of $n$ observations, each observation is a RV ~ NB. I would like to train on this 100 sets and then classify new elements as outliers from population or normal.

German Demidov
  • 1,731
  • 13
  • 27
  • 1
    Before thinking about the data distribution model, think about what is the right link function for your model. A normal-based model (after transformation ) often use the identity link, while a NB-based model often use the log link. That leads to differeing interpretations, see my answer to: http://stats.stackexchange.com/questions/142338/goodness-of-fit-and-which-model-to-choose-linear-regression-or-poisson/142353#142353 – kjetil b halvorsen Feb 24 '16 at 15:26
  • 1
    It is really useful comment, I will use it for sure for several problems and the example of splitting is really nice. What if I want to use simple classification model like Naive Bayes? It just calculates posteriors so I guess there is no link function...but the estimation of distributions' parameters is extremely crucial for it. – German Demidov Feb 24 '16 at 15:56

0 Answers0