3

Suppose I have data and I want to fit a two component Gaussian mixture to it. I don't know how to do it in python but worse than that is that I have an additional constraint that the mean of one component should be less than zero and the mean of the other component should be greater than or equal to zero.

Does anyone have any experience or python code for doing this type of thing?

mlofton
  • 2,417

2 Answers2

4

Here's an alternate method that you'll have more control over. The idea is to use scipy's minimize + autograd to do the heavy lifting for you.

  1. We define our negative log-likelihood, which we wish to minimize. Note that I'm going to optimize the scale parameters in log-space, as this is much easier. (Take the exp to transform back)

  2. We generate some fake data.

  3. We set up the minimize code. Note the bounds argument. For each parameter, [mu1, log_sigma1, mu2, log_sigma2, w], we have a corresponding bound. A None represents unbounded in that direction.

  4. printing results.x gives us our estimates.

from autograd.scipy.stats import norm
from autograd import numpy as np
from autograd import value_and_grad
from scipy.optimize import minimize
from scipy.stats import norm as norm_

def negative_log_likelihood(params, obs): mu1, log_sigma1, mu2, log_sigma2, w = params sigma1 = np.exp(log_sigma1) sigma2 = np.exp(log_sigma2) return -np.log(w * norm.pdf(obs, mu1, sigma1) + (1-w) * norm.pdf(obs, mu2, sigma2)).sum()

generate some fake data. Success will be if we recover these parameters.

obs = np.r_[ norm_(loc=2, scale=1).rvs(200), norm_(loc=-0.75, scale=0.5).rvs(500), ]

results = minimize( value_and_grad(negative_log_likelihood), # see autograd docs. x0 = np.array([1, 0, -1, 0, 0.5]), # initial value args=(obs,), jac=True, bounds=( (0, None), # mu1 (you mentioned the constraints on the means) (None, None), # log_sigma1 is unbounded (None, 0), # mu2 (you mentioned the constraints on the means) (None, None), # log_sigma2 is unbounded (0, 1) # the weight param should be between 0 and 1 ) )

print(results.x)

output:

[ 2.03682793 -0.01359715 -0.7450283  -0.67540341  0.28439886]
  • wow. this looks great. I will check off your answer. I like the other suggestion too but I prefer algorithms where one is building things from scratch as opposed to using something that's more of a black box. Thank you very much for the python code also. – mlofton Feb 08 '20 at 14:21
1

Take a look at the sklearn implementation of Gaussian mixture modeling. Assuming it uses something like EM, the mean of the components spanning the origin shouldn't be an issue.

an1lam
  • 206
  • thank you. I'll have a look and it's much appreciated. but I need that constraint so the means have to have different signs. This sounds to me like it makes things much more complicated ? – mlofton Feb 08 '20 at 14:19
  • Oh I see, it's not that you think the means will get fit with different signs, it's that you need to force that to happen? – an1lam Feb 08 '20 at 15:51
  • yes. it's a constraint on the problem but the Python code of @Cam Davidson Pilon handles that through the use of the bounds argument. Thanks for your help. – mlofton Feb 09 '20 at 20:11