3

Is it possible to use a dataset with mixed variables such as continuous, ordered, and categorical variables and cluster the data using the Gaussian Mixed Model with EM algorithm. I cannot find anywhere if this is possible. Apologies if this seems like a obvious question but I cannot seem to find the answer online?

Thanks

ttnphns
  • 57,480
  • 49
  • 284
  • 501
Martin
  • 99
  • In practice, the finite mixture model can also work with variables with mixed distribution. ref: [https://www.sciencedirect.com/science/article/pii/S019126152300036X#:~:text=Lastly%20and%20importantly%2C%20whereas%20continuous,be%20covered%20in%20Section%204.1).][1] – preota Mar 12 '24 at 11:46

3 Answers3

3

The most straightforward technique you can try is a one-hot encoding in order to convert your discrete features into numeric ones. However, be aware that this will increase your dimensionality, so it may be more difficult to get higher performance. It's also not quite appropriate since Gaussians are better suited for continuous variables.

If these discrete features are truly important, then I agree with @hxd1011 that you'll need to represent those features separately from the continuous ones, then combine them in the joint.

One way to do this is to consider a "blocking" scheme, where you split your data into groups for every combination of discrete variables. For instance, if you only have two binary variables $A,B$ and the rest of the continuous features are in $X$, then you can split your data into 4 groups: $P(X|A=1,B=1), P(X|A=1,B=0), P(X|A=0,B=1), P(X|A=0,B=0)$. Of course, you'll still need to model the distributions of $A,B$ with what you see fit. After, you can combine them to form the joint:

$$ P(X,A,B) = P(X|A,B)P(A,B) $$

This way, you can model each of the four group with a GMM if you want, and the categorical features with some other discrete distribution. Note this requires you have a sufficient number of points for each group.

Booley
  • 217
  • The "blocking" scheme will also go skyrocket just like the one-hot-encoding if the number of such cases is large in number. – hafiz031 Mar 27 '22 at 14:00
2

Gaussian is defined on a continuous variable. If we have more than one (continuous) features, we can model data as a N dimensional Gaussian.

For data mixed with continuous and discrete features. We still need a way to describe the joint. For example, suppose we have two features $X$ and $Y$, we assume $X\sim \mathcal{N}(\mu,\,\sigma^{2})$. $Y \sim \text{Bernoulli}(p)$, we still need way to define the joint $P(X,Y)$.

So the answer to your question is we can use directed graphical model on mixed data, but Gaussian is still on continuous variables.

You can find examples here.

https://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

Haitao Du
  • 36,852
  • 25
  • 145
  • 242
1

Just to expand on the directed graphical model point. If you can assume conditional independence of your continuous features $X$ from your categorical features $Y$ given your latent class $Z$, your likelihood will factorize as $$P(X, Y)=\sum_{z} P(X, Y, Z=z)=\sum_{z} P(X|Z=z)P(Y|Z=z)P(Z=z)$$ meaning can now freely specify the type of each conditional as $X|Z=z \sim \mathcal{N}(\mu_z, \Sigma_z)$ and $Y|Z=z \sim \text{Bernoulli}(p_z)$ or $Y|Z=z \sim \text{Multinoulli}(p^z_1, p^z_2, ..., p^z_K)$. This is the approach we used in the StepMix Python and R Package.

Chu24
  • 11
  • 2