Distinguishing the odd population out

Question

I'm trying to determine the "special" population out of n populations.

I have measurements with n labels.

label          measurement
a              1.4
b              1.2
a              1.2
c              1.3
...

Measurements with n-1 of the labels are drawn iid from the same normal distribution with mean $\mu_{default}$ and variance $\sigma^2_{default}$.
One label is "special" and its measurements are drawn from a different normal distribution with mean $\mu_{special}$ and variance $\sigma^2_{special}$.
$\mu_{special}$ > $\mu_{default}$ but I don't de novo know anything about the variances. (In practice $\sigma^2_{special}$ is probably close to $\sigma^2_{default}$ but I'd prefer not to make that assumption if possible.)

My goal is to find probabilities that each label is special; that is, if the labels are a, b, and c, I can compute

p(a is special | data)
p(b is special | data)
p(c is special | data)

such that all probabilities add to 1.

I can run Welch's t-test for each label against the rest of the data to find p-values for each label – but these aren't likelihoods! This gives me good results to determine by inspection which label is special, but I don't know of a justifiable way to generate the probabilities I want above.

The above may be a bit confusing due to bad terminology, so here's an example that I hope helps motivate the problem:

Consider a game with a 6-sided die. The die is hidden from the player. Every turn, the die is rolled.

If the die rolls on value 1 (the "special" value), a random measurement m is taken from a normal distribution with parameters $\mu_{special}$ and $\sigma^2_{special}$. (1, m) is reported to the plauyer.
If the die rolls on any other value $v \in [2,6]$, a random measurement m is taken from a normal distribution with the parameters $\mu_{default}$ and variance $\sigma^2_{default}$ and (v, m) is reported.

The player can see all of the (die outcome, measurement) pairs but does not know anything about the underlying distributions other than that they are normal and $\mu_{special}$ > $\mu_{default}$

The player's goal is to derive the probabilities that each value is special, that is, derive p(1 is special), p(2 is special)... p(6 is special) given a long list of (die outcome, measurement) pairs. The player has is a uniform prior (the special value is chosen uniformly at random before the game starts.)

Your use of the terms "sample," "population," and "distribution" does not seem to make the standard distinctions, making your account rather confusing. Are you perhaps equating "population" with "distribution"? If not, what is the distinction? Then, although your question is Bayesian and requires a prior distribution and application of Bayes' Theorem to address, why then are you introducing t-testing? — whuber, Feb 01 '24 at 20:15
It's been a long time since I did proper statistics, but I'll define what I mean here: Sample: a single measurement taken. Population: the set of data from which that measurement was taken. Distribution: the parameters that define the population. t-testing was my first guess about how to derive a likelihood – I could sum log-likelihoods or some such but I'm not sure doing so would be valid with populations of different sizes. — vroomfondel, Feb 01 '24 at 21:20
To further clarify my distinction between population and distribution: I have data that is labeled (a, b, c, d, e, f, ...). I know that all but one can be described by a normal distribution with "default" parameters as defined in the question, and one can be described by a normal distribution with "special" parameters. So, population === "the total set of data under some label," and distribution is "the distribution that describes n-1 and 1 of these populations." I've edited the question to replace "population" with "label" and "sample" with "measurement." — vroomfondel, Feb 01 '24 at 21:26
Thank you. Good keywords to use in research include mixture distribution and contaminated distribution. But the reference to a t-test remains confusing, because it suggests at least two different valid interpretations of your question. — whuber, Feb 01 '24 at 22:26
The word you are looking for is "outlier" and lots has been written about outlier detection. Searching for that keyword should help you find what you need. — Harvey Motulsky, Feb 01 '24 at 21:43
I don't think I'm quite looking for outlier detection. What I'm looking for is something closer to this question: stats.stackexchange.com/q/490853/406016 except in my case, the actual distribution parameters are unknown. — vroomfondel, Feb 01 '24 at 22:12
"but I don't know of a justifiable way to generate the probabilities I want above." This can only be done with a Bayesian approach. You will need to use some prior probability distribution for the special population, but you also need an assumption for the distribution of the nuisance parameters $\mu$ and $\sigma$. — Sextus Empiricus, Feb 04 '24 at 21:14
@SextusEmpiricus, I am curious what you think of my non-Bayesian maximum likelihood approach. — Matt F., Feb 06 '24 at 02:55
@MattF. it is interesting to use the formula $$P\big[\text{label }i\text{ is special}\ |\ data\big]=p_i/\sum_j p_j$$ but I am not sure it is right. I believe one should be using some integration over nuisance parameters. Also using the estimates $L/N$ and $Q/N-L^2/N^2$ does not take into account the condition in the model that $\mu_{special}>\mu_{default}$ — Sextus Empiricus, Feb 06 '24 at 06:32
Yep, assuming this is correct it’s a super elegant answer but as @SextusEmpiricus said, the only thing it’s missing is the condition. I’ve been playing with this a bit and in practice applying the condition helps a lot (when messing with a non-analytical approach) since the distributions can be close. — vroomfondel, Feb 06 '24 at 13:31

Matt F. · Answer 1 · 2024-02-05T06:51:25.187

5

Here is a maximum likelihood approach.

For each label $i$, let $N_i$ be the number of observations with that label, let $L_i$ (the linear term) be the sum of those observations, and let $Q_i$ (the quadratic term) be the sum of their squares. Let $N$, $L$ and $Q$ be the sums over all $i$.

Let $p_i$ be the maximum likelihood of the data if label $i$ is special. On this assumption, label $i$ is normally distributed with mean $L_i/N_i$ and variance $Q_i/N_i-L_i^2/N_i^2$, the likelihood of the data for label $i$ turns out to be $$f(N_i,L_i,Q_i)=\left(2\pi e\Big(\frac{Q_i}{N_i}-\frac{L_i^2} {N_i^2}\Big)\right)^{-N_i/2}$$ and $$p_i=f(N_i,L_i,Q_i)\,f(N-N_i,L-L_i,Q-Q_i)$$

Since one of the labels must be special, this would yield $$P\big[\text{label }i\text{ is special}\ |\ data\big]=p_i/\sum_j p_j$$

edited Feb 05 '24 at 06:51

answered Feb 05 '24 at 01:47

Matt F.

4,726

Once I justify this to myself it's exactly what I'm looking for. As an added bonus it looks like it would be trivial to adapt this to online updates – vroomfondel Feb 05 '24 at 03:45
1

The power $n/2$ can be absorbed into the exponent. $$\left(e^{x/n}\right)^n = e^x$$ – Sextus Empiricus Feb 06 '24 at 09:58
Possibly using the sum of squared residuals may give a nicer expression for the likelihood. – Sextus Empiricus Feb 06 '24 at 09:59
1

@SextusEmpiricus, it’s not getting any nicer! The base in that expression is $2\pi e$ times a difference, not $2\pi$ times an exponential; and if you change notation to make $Q_i$ a sum of squared residuals for the special label, that makes it awkward to write down the probability of the other labels in the second factor of $p_i$. – Matt F. Feb 06 '24 at 12:03
@MattF. It is confusing how you derived that expression. – Sextus Empiricus Feb 06 '24 at 13:09
1

It comes from a straightforward calculation of the likelihood using the MLE mean and variance, then simplifying using $L_i$ and $Q_i$, and then cancelling — I’d welcome anyone editing the post to write out the intermediate steps, but I’m happy just seeing the conclusion. – Matt F. Feb 06 '24 at 21:58
@MattF. without the exponential it seems like you have been using the log likelihood instead of likelihood and it shouldn't have the power. – Sextus Empiricus Feb 07 '24 at 05:57
I don’t think it’s helpful for me to defend my algebra further, so I’ll stop engaging on this question. – Matt F. Feb 07 '24 at 09:10
1

@MattF. you are right, the exponential disappears from the expression. The likelihood is related to the residual sum of squares in the following way. The full equation for the likelihood
$$\mathcal{L}(\hat{\mu},\hat{\sigma}) = \prod_{i = 1}^n \frac{1}{\sqrt{2\pi\hat\sigma^2}}\exp\left(-\frac{(x_i-\hat\mu)^2}{2\hat\sigma^2}\right) = \hat\sigma^{-n} \left(\frac{1}{2\pi}\right)^{n/2} \exp\left(- \frac{1}{2\hat{\sigma}^2} \sum_{i=1}^n (x_i - \hat\mu)^2\right)$$ ...
– Sextus Empiricus Feb 07 '24 at 15:27
2

... and substituting maximum likelihood estimates $\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i$, $RSS = \sum_{i=1}^n (x_i - \hat{\mu})^2$ and $\hat{\sigma}^2 = RSS/n$ you get
$$\mathcal{L}(\hat{\mu}{mle},\hat{\sigma}{mle}) = \left(2\pi e\right)^{-2/n} \left(\frac{RSS}{n}\right)^{-n/2} $$
– Sextus Empiricus Feb 07 '24 at 15:27

Sextus Empiricus · Answer 2 · 2024-02-07T10:15:46.263

^{I have no direct solution to the problems below, but I believe that it important to mention, to warn that a straightforward answer is not as simple as it looks like and it should be nuanced that they do not give an exact solution.}

It is possible to fit different models based on different hypotheses, and compute the likelihood in the fitted point. But a tricky part of the question is that the true variance might be different from the fitted variance, and the problem becomes a variant of the Behrens Fisher problem.

In addition, one may compare likelihoods, but it is not ensured that we will see the true model as relatively most often with the highest likelihood.

An example computation is below where we have 5 populations and for small effect sizes the special population is recognized less than 20% of the time.

(the computations below are for different groups sizes to make the effect more clear, and we tested the hypothesis $\mu_{special} \neq \mu_{default}$ instead of the more difficult $\mu_{special} > \mu_{default}$)

set.seed(1)
power = function(effect = 0, n = 500) {
  count = 0 ### counter to keep track of right decisions
simulate n times
for (i in 1:n) {
### simulate a sample
x1 = rnorm(5)
x2 = rnorm(5)
x3 = rnorm(5)
x4 = rnorm(5)
y  = rnorm(50,effect,1)

### compute log-likelihood for different models
l1 = logLik(lm(c(x1,x2,x3,x4) ~ 1))+logLik(lm(c(y) ~ 1))
l2 = logLik(lm(c(x2,x3,x4,y) ~ 1))+logLik(lm(c(x1) ~ 1))
l3 = logLik(lm(c(x1,x3,x4,y) ~ 1))+logLik(lm(c(x2) ~ 1))
l4 = logLik(lm(c(x1,x2,x4,y) ~ 1))+logLik(lm(c(x3) ~ 1))
l5 = logLik(lm(c(x1,x2,x3,y) ~ 1))+logLik(lm(c(x4) ~ 1))

### convert to likelihood 
p1 = exp(l1-min(l1,l2,l3,l4,l5))
p2 = exp(l2-min(l1,l2,l3,l4,l5))
p3 = exp(l3-min(l1,l2,l3,l4,l5))
p4 = exp(l4-min(l1,l2,l3,l4,l5))
p5 = exp(l5-min(l1,l2,l3,l4,l5))

### keep track when the true group has the highest likelihood 
if (p1 &gt; max(p2,p3,p4,p5)) {count = count +1}

}
  return(count/n)
}
create a plot
plot(c(-1,2), c(0.2,0.2),
     xlim = c(0,0.5), xlab = "effect size",
     ylim = c(0,1), ylab = "fraction recognized",
     type = "l", lty = 2)
add points for probability of correct decisions as function of effect size
for (effect in seq(0,0.5,0.05)) {
    points(effect,power(effect))
}

This relates to the discrepancy between frequentist and bayesian analyses. A Bayesian analysis will give the probabilities conditional on the observations and not conditional on the hypothesis. That's why we can get low <20% values, when we condition on the hypothesis (we had a fixed hypothesis). See also Why does a 95% Confidence Interval (CI) not imply a 95% chance of containing the mean?

This is helpful, thanks. A bayesian approach with weak assumptions should work fine. I'm curious about your comment "The situation here is a bit different from the t-test where the variance is a nuisance parameter. In your case it is part of the parameters to be estimated." – I don't actually care about the variance in practice; I just want to find the one distribution with a higher mean. wouldn't that mean that the variance is a nuisance parameter as in a t-test? (obviously, running a t-test isn't really statistically valid. Just trying to grok what you mean by "nuisance.") — vroomfondel, Feb 06 '24 at 18:15
@vroomfondel so you want to determine which population has a different mean, and not which population has a different mean and/or variance? — Sextus Empiricus, Feb 06 '24 at 20:37

DrJerryTAO · Answer 3 · 2024-02-07T07:34:10.057

From a frequentist perspective, your problem can form several binary regression models of predicting the probability of a special label based on a continuous measurement, under different assumptions of which label is special. We choose the model that fits a manipulated response best, corresponding to one of the several assumptions of which label is special. If your label is controlled, so each experiment consists of one run of each label and all labels are tested each time, then you can use multinomial regression. Your analogue of throwing dices shows that the binary regression fits your need best. Although your measurements follow normal distributions, you can use logit, probit, or any other link function because the model specification is based on error assumptions instead of predictor distributions.

Unlike a typical data set for binary regression, you have to manipulate the response variable as many times as the number of labels. If we assume that Label A is special, then the response variable takes 1 if the Label is A and 0 otherwise, as shown below.

label      measurement   response
    a              1.4          1
    b              1.2          0
    a              1.2          1
    c              1.3          0
...

If we assume that Label B is special, then the response variable takes 1 if the Label is B and 0 otherwise. If we assume that Label C is special, then the response variable takes 1 if the Label is C and 0 otherwise. Then we fit three different binary regression models. In R, the models are fitted as

ModelA <- glm(I(label == "a") ~ measurement, data = ..., family = binomial(link = "logit"))
ModelB <- glm(I(label == "b") ~ measurement, data = ..., family = binomial(link = "logit"))
ModelC <- glm(I(label == "c") ~ measurement, data = ..., family = binomial(link = "logit"))

Thus we have three candidate models. The first model assumes Label A is special, whereas the last model assumes Label C is special. Each model gives a maximized log likelihood, which should be higher if the assumption of special label is correct and lower and equal among other models. Suppose AIC values based on the log likelihoods are 100, 102, and 110. Then the first model is exp((100 − 100)/2) = 1.00 times as probable as the first model to minimize the information loss; the second model is exp((100 − 102)/2) = 0.368 times as probable as the first model to minimize the information loss; the third model is exp((100 − 110)/2) = 0.007 times as probable as the first model to minimize the information loss. Therefore, the probability of ModelA is the true model is

exp((AIC(ModelA) - AIC(ModelA))/2) / sum(
  exp((AIC(ModelA) - AIC(ModelA))/2) + 
  exp((AIC(ModelA) - AIC(ModelB))/2) +
  exp((AIC(ModelA) - AIC(ModelC))/2))

which is 1/(1 + 0.368 + 0.007) = 0.7274752. We can derive the probability for B and C being special as 0.368 /(1 + 0.368 + 0.007) = 0.2676232 and 1/(1 + 0.368 + 0.007) = 0.004901689. And the three probabilities sum up to one. See the logic at "How to use AIC in practice" at https://en.wikipedia.org/wiki/Akaike_information_criterion.

On the other hand, you can also model the process as linear models of predicting the population mean of measurements based on labels. In R, we fit three models

ModelA <- glm(measurement ~ I(label == "a"), data = ..., family = gaussian(link = "identity"))
ModelB <- glm(measurement ~ I(label == "b"), data = ..., family = gaussian(link = "identity"))
ModelC <- glm(measurement ~ I(label == "c"), data = ..., family = gaussian(link = "identity"))

Based on the log likelihoods and AIC, we can derive the three probabilities as shown above for binary logistic models. To relax the equal-error-variance assumptions which correspond to equal variances of measurements among all labels, we can use generalized least squares that allow the residual standard deviation to differ by a ratio between the assumed special label and other labels.

ModelA <- nlme:gls(measurement ~ I(label == "a"), weights = varIdent(form = ~ 1 | I(label == "a")), data = ...)
ModelB <- nlme:gls(measurement ~ I(label == "b"), weights = varIdent(form = ~ 1 | I(label == "b")), data = ...)
ModelC <- nlme:gls(measurement ~ I(label == "c"), weights = varIdent(form = ~ 1 | I(label == "c")), data = ...)

The tricky part of the question is that the variance might be different and the problem becomes a variant of the Behrens Fisher problem. These glm models assume equal variance and solve a different problem. — Sextus Empiricus, Feb 07 '24 at 06:09
What about the generalized least squares GLS models I listed at the end of the answer? Does their relative likelihood answer the question? — DrJerryTAO, Feb 07 '24 at 07:34
the GLS models allow the use of different variance, which allows to compute a likelihood, but is it the right likelihood? The variances used are based on fitted variances and may not need to be the true variances. — Sextus Empiricus, Feb 07 '24 at 07:41
Another thing, the glm and gls functions do not allow fitting with boundary conditions. The coefficient describing the effect needs to be non-zero. — Sextus Empiricus, Feb 07 '24 at 07:44

Distinguishing the odd population out

3 Answers3

simulate n times

create a plot

add points for probability of correct decisions as function of effect size