3

enter image description here

I am trying to fit a distribution to the data so I can use glm in order to predict more data. The problem is that with something like this all the tests of the distributions I know are rejected.


Edit:

Here is the ECDF that was requested.

enter image description here

Mr. Liu
  • 131
  • 3
  • 1
    You are dealing with some kind of mixture distribution, possibly zero-inflated distribution, but without knowing anything more about your data it is hard to say anything. – Tim Oct 29 '16 at 20:51
  • You have a very large N; since the underlying distribution probably doesn't match any ideal distribution exactly your large sample size will ensure that tests will reject that distribution. Even if a test rejects it you shouldn't rule it out as being a good model. – Hugh Oct 29 '16 at 20:54
  • @Mr. Liu what type of data is it ? Is it time series? – Erik Hambardzumyan Oct 29 '16 at 20:55
  • No, is the number of hours to paint the damaged areas of a car, so I guess it's just a continuous variable – Mr. Liu Oct 29 '16 at 21:01
  • @Hugh so what distribution should I use for glm? normal? Or should I try all of them? – Mr. Liu Oct 29 '16 at 21:03
  • 1
    @Mr.Liu, are the hours only in whole numbers, or are there numbers like 2.5? – gung - Reinstate Monica Oct 29 '16 at 21:40
  • Yes, there are numbers like 2.5 – Mr. Liu Oct 29 '16 at 22:14
  • @Mr.Liu I would try to model it as a bimodal normal distribution or another type of mixture distribution which you think looks right. – Hugh Oct 29 '16 at 23:50

2 Answers2

6

The spike at zero in your plot shows density at negative values. Presumably there are no negative times which also suggests that the 0-values are possibly all just exact zeros. It's important you make it explicit whether or not the data set contains exact 0-values! (In fact if the question hadn't already been answered ignoring this possibility I would have asked you to clarify before getting answers).

[A KDE is not suitable for a mixed distribution, because it leads us into erroneously using models that don't correspond to the actual data. For example, look at the suggested mixture density in the other answer -- that thought of a mixture of two continuous distributions was generated by your plot -- indeed I'd have likely thought along the same lines if I hadn't noticed these were times -- but it would lead to a prediction interval that included negative times, just as your plot suggests! Can you show an ECDF? That would avoid suggesting times can be negative.]

If it's the case that the spike near zero is all-zero, you could use (say) zero-inflated GLMs; a gamma distribution for the non-zero portion might work. If you have suitable predictors for both the 0-time/non-zero-time process, you could also just model the 0 and non-zero values via logistic regression and then model the size of the non-zeros by a GLM or by taking logs and fitting a regression -- essentially a conditional model along the lines of a hurdle model.

Note also that in general you shouldn't judge what distributions may be reasonable in a glm or regression from the shape of the marginal distribution. Your model is for the conditional distribution; the marginal distribution may be quite different. For example, my suggestion for the times (gamma) was based more on thinking about the likely skewed shape of time - even conditionally on some predictors - for painting damage on cars rather than the shape in the picture.]


Edit: the ECDF added to the question shows that there is a substantial collection of exact 0's -- slightly more than 20% of the total. As mentioned earlier in my answer, zero-inflated or hurdle models would be one approach to this.

However, an additional issue may be the large gap to the low end of the continuous part of the distribution. It may be better to consider a shifted model that incorporates a minimum time (conditional on it being positive) of a little under an hour (this doesn't say that there aren't values in between 0 and just short of an hour -- only that there are so few of them that a model that assumed that times were either "effectively zero" or "at least n minutes" (where $n$ might be say 50 or thereabouts) might be a good approximation.

Glen_b
  • 282,281
  • Why do you say that a KDE can't be used w/ a mixture distribution? – gung - Reinstate Monica Oct 29 '16 at 21:43
  • @gung I didn't say it can't. I said it's not suitable. When used on data with (say) exact 0's, it leads people to mistakenly model data as continuous when it isn't -- which in this case would lead us to produce prediction intervals which include negative times (as the other answer on this page actually suggests!). I think calling displays that lead people into predictions that include positive probability for negative times unsuitable is perfectly apt. – Glen_b Oct 29 '16 at 21:46
  • Hmmm, I guess I see what you mean. I wouldn't have a problem w/ using it myself, b/c I wouldn't be thrown off by that. – gung - Reinstate Monica Oct 29 '16 at 21:49
  • I usually try to avoid plots that might tend to mislead others -- where "others" definitely includes "me in a couple of years time after I have forgotten the details". I might look at a KDE but, I wouldn't use it if I was keeping the plot. In this case the OP is showing a plot to other people (all of us) and failing (in the question) to actually mention that the variable is necessarily non-negative, so this issue of unsuitability definitely applies. A suitable display would avoid the potential for us to suggest models that produce prediction intervals that include negative values. – Glen_b Oct 29 '16 at 22:01
  • the times are not negative: Empirical CDF Call: ecdf(hours) x[1:4470] = 0, 0.015779, 0.26458, ..., 6.8447, 7.3137 – Mr. Liu Oct 29 '16 at 22:26
  • @Mr.Liu I meant "please show an ecdf plot in your question". In R you'd do plot(ecdf(x))). Your comment doesn't show whether the small values are nearly-all 0 or whether there's a substantial proportion of "a little larger than 0" values. An alternative would be to show plot(density(x[x>0])) .... I note that one of your times there appears to be 56.8 seconds. How was that measured to a fraction of a second? (And how does it take 56 seconds to paint anything?) – Glen_b Oct 29 '16 at 22:33
  • @Glen_b sorry, I didn't know how to upload an image in a comment, I have uploaded it here: http://imgur.com/NNh6HsQ – Mr. Liu Oct 29 '16 at 23:00
  • @Mr.Liu Thanks, I have put it in your question so other answerers can see what's going on with the data. It is as I supposed it might be, mostly exact zeros. – Glen_b Oct 29 '16 at 23:11
2

First please review these post for hypothesis testing on large data. With the such a size, it is almost certain to reject any distribution.

Are large data sets inappropriate for hypothesis testing?

I would agree with Tim's comment that mixture of Gaussian can be used in this case. Here is an example.

gaussmix <- function(n,m1,m2,s1,s2,alpha) {

  I <- runif(n)<alpha  
  rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}

s <- gaussmix(10000,0,2,0.05,0.5,0.2)

library(mixtools)
mixmdl = normalmixEM(s)
plot(mixmdl,which=2)
grid()

enter image description here

Haitao Du
  • 36,852
  • 25
  • 145
  • 242