I am trying to fit a distribution to the data so I can use glm in order to predict more data. The problem is that with something like this all the tests of the distributions I know are rejected.
Edit:
Here is the ECDF that was requested.
I am trying to fit a distribution to the data so I can use glm in order to predict more data. The problem is that with something like this all the tests of the distributions I know are rejected.
Edit:
Here is the ECDF that was requested.
The spike at zero in your plot shows density at negative values. Presumably there are no negative times which also suggests that the 0-values are possibly all just exact zeros. It's important you make it explicit whether or not the data set contains exact 0-values! (In fact if the question hadn't already been answered ignoring this possibility I would have asked you to clarify before getting answers).
[A KDE is not suitable for a mixed distribution, because it leads us into erroneously using models that don't correspond to the actual data. For example, look at the suggested mixture density in the other answer -- that thought of a mixture of two continuous distributions was generated by your plot -- indeed I'd have likely thought along the same lines if I hadn't noticed these were times -- but it would lead to a prediction interval that included negative times, just as your plot suggests! Can you show an ECDF? That would avoid suggesting times can be negative.]
If it's the case that the spike near zero is all-zero, you could use (say) zero-inflated GLMs; a gamma distribution for the non-zero portion might work. If you have suitable predictors for both the 0-time/non-zero-time process, you could also just model the 0 and non-zero values via logistic regression and then model the size of the non-zeros by a GLM or by taking logs and fitting a regression -- essentially a conditional model along the lines of a hurdle model.
Note also that in general you shouldn't judge what distributions may be reasonable in a glm or regression from the shape of the marginal distribution. Your model is for the conditional distribution; the marginal distribution may be quite different. For example, my suggestion for the times (gamma) was based more on thinking about the likely skewed shape of time - even conditionally on some predictors - for painting damage on cars rather than the shape in the picture.]
Edit: the ECDF added to the question shows that there is a substantial collection of exact 0's -- slightly more than 20% of the total. As mentioned earlier in my answer, zero-inflated or hurdle models would be one approach to this.
However, an additional issue may be the large gap to the low end of the continuous part of the distribution. It may be better to consider a shifted model that incorporates a minimum time (conditional on it being positive) of a little under an hour (this doesn't say that there aren't values in between 0 and just short of an hour -- only that there are so few of them that a model that assumed that times were either "effectively zero" or "at least n minutes" (where $n$ might be say 50 or thereabouts) might be a good approximation.
plot(ecdf(x))). Your comment doesn't show whether the small values are nearly-all 0 or whether there's a substantial proportion of "a little larger than 0" values. An alternative would be to show plot(density(x[x>0])) .... I note that one of your times there appears to be 56.8 seconds. How was that measured to a fraction of a second? (And how does it take 56 seconds to paint anything?)
– Glen_b
Oct 29 '16 at 22:33
First please review these post for hypothesis testing on large data. With the such a size, it is almost certain to reject any distribution.
Are large data sets inappropriate for hypothesis testing?
I would agree with Tim's comment that mixture of Gaussian can be used in this case. Here is an example.
gaussmix <- function(n,m1,m2,s1,s2,alpha) {
I <- runif(n)<alpha
rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}
s <- gaussmix(10000,0,2,0.05,0.5,0.2)
library(mixtools)
mixmdl = normalmixEM(s)
plot(mixmdl,which=2)
grid()