Identifying a mixture model

Question

I am trying to fit a mixture model to a dataset that consists of counts (so every record is a count of something, like the number of attempts by an IP address to connect to a website). I know, a priori, that each point belongs to one of two groups (for example malicious IPs and legitimate IPs), but I don't know which one without manually inspecting the record and determining the grouping. These are the things that I know though:

As the count value increases, it is more and more likely that the data point belongs to group 2.
80 percent of values are 1 and 85% are less than 2.
Most values for group 1 are 1, but there are values as high as 10,000 that are from group 1.

Ideally, I'd like to be able to draw a line somewhere and say if the count is higher than some $n$, then I am $p$% sure that this record is from group 2.

I think a mixture model is a good place to look at to model this data, but looking at the histogram and kernel density plot, I don't see any distinct clumps in data. Based on what I mentioned above, my guess is that a mixture of a log-logistic (for group 2) and pareto (for group 1) might be an ok approximation to the data. Is there an R package or python module that would estimate the parameters of such a mixture?

score 1 · Answer 1 · answered May 26 '14 at 23:19

1

Ideally, I'd like to be able to draw a line somewhere and say if the count is higher than some n, then I am p% sure that this record is from group 2.

It sounds like you want a model that gives an estimate of

$P(Y=1|X=x)$

where $X$ is the variable representing the count, and $Y$ indicates membership of group 2 (Y=1 means "is a member").

When you say "A mixture of log-logistic and Pareto" it sounds like you're referring to $X$. But in the model, you condition on the value of $X$ so what its distribution is would be irrelevant (and strictly speaking, being a count, it can't be either log-logistic or Pareto since those are continuous) to that calculation.

There are a variety of tools for this job, but you might like to consider logistic regression as a first step. That will not require you to draw a line - it will give you an estimate of the probability at any $x$. If you want to pre-specify $p$, you can back out an estimate of the $x$ that corresponds to that $p$.

Is there an R package or python module that would estimate the parameters of such a mixture?

This is such a different question as to merit its own post. In the first question you're modelling $Y|X=x$. Now it seems you want to deal with $X|Y=y$ ($y$ takes the values 0 and 1). Have I understood? You want a model that conditions the other direction?

answered May 26 '14 at 23:19

Glen_b

282,281

Thanks @Glen_b. My problem is that $Y$ is a latent variable. If I had the value of $Y$ for every record, then logistic regression would be a great way to start. But then if I had $Y$, I had no problem at all, since I knew exactly what are the malicious IP's. – user765195 May 27 '14 at 00:44
Either you have some data with group membership on which to construct a classification rule or you don't, and your question clearly suggests that you do. If you have no data on group membership, how can the first question I quoted be answered at all? If you have some such data, what prevents applying logistic regression to that data? – Glen_b May 27 '14 at 00:53
I can inspect each record and dig into IP lists to see if it is malicious or not (if it is listed somewhere), but my data does not have value of the group for each record. The dataset has about 500K records. – user765195 May 27 '14 at 00:57
I should also mention that I'm looking for $X|Y=y$ as a mixture. – user765195 May 27 '14 at 00:58
In respect of the first question -- to train your classification model (i.e. estimate parameters), you need a training sample covering a range of $X$, where you know the groups. To answer the second question you need the same thing. If that sample is large enough, you can estimate the conditional distribution of $X$, and the unconditional (i.e. mixture distribution) without assuming a specific distributional model, but the smaller your samples the more you'll likely need to bring to the table in terms of assumptions. – Glen_b May 27 '14 at 01:03
Beware, however: if you classify the rest of your data into the two groups based on a small sample of it, and then use that classified data to try to flip the conditioning around to get your mixture, it won't contain any more information than the original sample you trained on. – Glen_b May 27 '14 at 01:05
I could be wrong, but I think sometime you can do without labels. If my data consisted of clear clusters (say two normal-looking clumps), I could probably use my a priori knowledge that I listed above and identify the mixture model. I'm not sure, but I think Latent Profile Analysis does something of this nature. – user765195 May 27 '14 at 02:30
1

Yes, if you bring a priori knowledge that clusters will be unimodal, and the situation is one where there can only be two clusters, and you observe a bimodal distribution, the problem is well-formed and has a solution. (But if that a priori knowledge comes from a sample from the population of interest, you should use that sample to do your inference, so you make efficient use of the actual information.) – Glen_b May 27 '14 at 02:37

Identifying a mixture model

1 Answers1