I am trying to jointly estimate the components of a mixture distribution.
I have a sampling from a mixture, XY, composed of X and Y with a known mixing parameter m. I also have a separate sampling of just Y. I am trying to estimate the the PDF of X.
Here is a concrete example.
# Generate XY sampling data
m <- 0.2 # Mixing parameter
n <- 1000
k <- rbinom(1, n, prob = m)
xy <- c(rnorm(n-k, 1, 1), rnorm(k, 5, 0.5))
Estimate XY_pdf
XY_pdf <- density(xy)
plot(XY_pdf)
Generate independent Y sampling data
y2 <- rnorm(500, mean = 5, sd = 0.5)
Y_pdf <- density(y2, bw = XY_pdf$bw)
lines(Y_pdf$x, Y_pdf$y*m, col = "red", lty=2)
Function for calculating P_kde; https://stackoverflow.com/a/34682302/2723734
kde_val <- function(x, t, bw){
sapply(t, function(ti) {
kernelValues <- rep(0,length(x))
for(i in 1:length(x)){
transformed = (ti - x[i]) / bw
kernelValues[i] <- dnorm(transformed, mean = 0, sd = 1) / bw
}
return(sum(kernelValues) / length(x))
})
}
t <- seq(-3, 9, by = 0.01)
xy_val <- kde_val(xy, t, XY_pdf$bw)
y_val <- kde_val(y2, t, Y_pdf$bw) * m
x_val_est <- xy_val - y_val
lines(t, x_val_est, col = "blue")
The plot shows PDF(XY) and PDF(Y) estimated from KDE (black and red); and an estimate of PDF(X) = PDF(XY) - PDF(Y)*m (blue).
The estimate of PDF(X) is pretty good, except towards the tail where it becomes negative due to sampling variation in XY and Y.
How do I properly estimate PDF(X)?
(I can't assume the distributions are gaussian)




mfrom a previous experiments in identical conditions. – thc Mar 12 '21 at 19:39n <- 1000; k <- rbinom(1, n, prob = m); xy <- c(rnorm(k, 1, 1), rnorm(n-k, 5, 0.5))Even that's not quite correct, because the values don't occur randomly. When it matters, this works:i <- sample.int(2, 1000, prob=c(m, 1-m), replace=TRUE); xy <- rnorm(length(i), ifelse(i==1, 1, 5), ifelse(i==1, 1, 0.5))You can also just execute a finalsample(xy)to randomize the data. – whuber Mar 12 '21 at 19:52