2

I have two data sets, $\{x_i\}$ and $\{y_i\}$. I know that data set $\{x_i\}$ was sampled from some distribution $X$, and that data set $\{y_i\}$ is sampled from a mixture of the $X$, and some other unknown distribution $Y$. I am wanting to estimate what the mixing ratio is/to know how many of the samples in $\{y_i\}$ come from $X$.

If I make some assumptions about $Y$ (such as it being normal) this is just a simple mixture model problem, but ideally I don't want to do this. I'm wondering if there is some approach to this problem, or if it isn't possible.

One idea that I had was to have a bunch of kernels (evenly spaced normal distributions with known $\sigma$), and use MLE to find their mixing ratios, but I assume doing so would just set the mixing ratio for $X$ to be zero, and just give me the KDE. Perhaps there is some way of penalising this, but my only thought was to set a prior on what I thought the mixing ratio of $X$ was, which I would rather avoid.

If it is possible to solve this problem for categorical mixture models, than I can just bin my data, but I couldn't find a way of solving this problem in a categorical sense either, or really anything to do with parameter estimates for categorical mixture models (which makes sense because the sample distribution would have the maximum likelihood)

DBruwel
  • 21
  • 2
    If you only have nonparametric density estimates for the two datasets (and stick to Frequentist methods), then I think there's an identifiability issue. If you know the mixing proportion $\alpha$, then one could estimate the density of $Y$ with $\hat{f}Y(y)=(\hat{f}{XY}(y)-\alpha \hat{f}_X(y))/(1-\alpha)$. But I think you can't get there from here if you don't know both the mixing proportion and the distribution of $Y$. Maybe your Bayesian suggestion might have promise. – JimB Feb 07 '24 at 18:27
  • 1
    After reading my comment again, I don't think my next-to-last sentence was very clear. What I meant was that either the mixing proportion or the distribution of $Y$ would be need to be known to estimate the other. If both were unknown, then that's where the identifiability issue comes into play. – JimB Feb 08 '24 at 05:06
  • @JimB Thanks, yeah I thought this would be the case. – DBruwel Feb 11 '24 at 22:49

0 Answers0