1

Is it possible to introduce cluster probabilities into a regression?

Consider the Old Faithful Geyser data set. Most clustering algorithms find 2 clusters when analysing eruption times and waiting times. We can fit regressions to the data in each cluster to estimate waiting times as a function of eruption time. However, with probabilistic clustering (e.g. gaussian mixture models) one can obtain an estimate of the probability that an observation is in each respective cluster.

How can we incorporate these probabilties into the regressions? If the probabilistic clustering model was to predict an observation has a 40% probability of being in cluster A then I would like to use this information, rather than discard it and only consider the observation in the other cluster.

I have thought about using sampling based on probability weights and weighting observed values by their probabilities. However, I think the latter will just dampen the regression line towards 0. Any suggestions or advice would be welcome.

29703461
  • 57
  • 6

1 Answers1

1

You can simultaneously estimate the cluster probabilities and regression coefficients using a mixture model such as $$Y_i |Z_i=z \sim N(\mu_z,\sigma_z^2),$$ where $Y_i$ is the waiting time associated with eruption $i$, $Z_i$ is the latent cluster of eruption $i$, $Z_i \in \{1,2\}$, and $P(Z_i=1)=p$. Here $$\mu_z=\beta_{z0} + \beta_{z1} x_i,$$ where $x_i$ is the duration of eruption $i$. You will likely get more efficient parameter estimates if you're reasonably able to assume that $\sigma^2_1=\sigma^2_2$, $\beta_{10}=\beta_{20}$, and/or $\beta_{11}=\beta_{21}$.

You cannot first estimate the cluster probabilities using this dataset and then treat them as known in a simultaneous analysis of the relationship between waiting time and duration in each cluster.