I am trying to fit a mixture model to a dataset that consists of counts (so every record is a count of something, like the number of attempts by an IP address to connect to a website). I know, a priori, that each point belongs to one of two groups (for example malicious IPs and legitimate IPs), but I don't know which one without manually inspecting the record and determining the grouping. These are the things that I know though:
- As the count value increases, it is more and more likely that the data point belongs to group 2.
- 80 percent of values are 1 and 85% are less than 2.
- Most values for group 1 are 1, but there are values as high as 10,000 that are from group 1.
Ideally, I'd like to be able to draw a line somewhere and say if the count is higher than some $n$, then I am $p$% sure that this record is from group 2.
I think a mixture model is a good place to look at to model this data, but looking at the histogram and kernel density plot, I don't see any distinct clumps in data. Based on what I mentioned above, my guess is that a mixture of a log-logistic (for group 2) and pareto (for group 1) might be an ok approximation to the data. Is there an R package or python module that would estimate the parameters of such a mixture?