This question involves both math and coding in R. Apologies if this should be on Stack Overflow, but I decided statisticians would ultimately provide better support.
I have a simple likelihood that infers parameters $R_0$ and $k$ using a vector of integers that represents the sum of values ("clusters", denoted $y$). The original likelihood was simply:
(1) $L(R_0,k|y)=\prod_{i=1}^{\infty}P(Y=y)^{y_i}$
Some the integers represent values that are now 'censored' (i.e. truncated), and thus some of the observed data should be at least size $y$. I would like to include this in the likelihood, so that the likelihood of $R_0$ and $k$ when $y_a$ values have been fully observed and $y_b$ values have been censored is:
(2) $L(R,k|y_a,y_b)=\prod_yP(Y=y)^{y_a}P(Y≥y)^{y_b}$
I would like assistance in extending this likelihood function in r and ensure it is mathematically and computationally correct.
Originally, my data consisted of a single a vector of clusters and ran through a custom function (original code is at end, including code to generate sample data).
I am unsure how to extend this to include $P(Y≥y)$. My data are now in a two column matrix, with the first column the integer values, and the second column indicating censorship status (1=censored, 0=not censored).
I have tried modifying the function to split apart the Y column into either "full" values and "censored values". I assume $P(Y≥y)=1-\sum^{y-1}_{i=1}P(Y=y)$. One attempt was to create a separate calc_sizes for full cluster and censored clusters (denoted b), where censored clusters subtracted one from the cluster size (i.e. calc_sizes.b<-unique(c(1,b-1)) to account for the $\sum^{y-1}_{i=1}$ aspect, but I am unsure if this is correct, and where or how to include the $1-f(x)$ function.
This initially seemed like a simple task, but I am finding it very challenging. I have never attempted custom likelihoods before, so any advice would be much appreciated.
##### generate sample data
##### data generating function
bp <- function(gens=20, init.size=1, offspring, ...){
Z <- list() #initiate the list
Z[[1]] <- init.size # set the first position of the list as the
# number of index cases
i <- 1
while(sum(Z[[i]]) > 0 && i <= gens) {
Z[[i+1]] <- offspring(sum(Z[[i]]), ...)
i <- i+1
}
return(Z)
}
##### generate vector of 10000 with true R0=0.9 and k=0.25
set.seed(123)
Y <- unlist(lapply(replicate(n=10000, bp(offspring=rnbinom, mu=0.9,
size=0.25)), function(x) sum(unlist(x))))
Original function to calculate likelihoods (equation 1)####
clust.likelihood <- function(Y, k, r0){
nb.likelihood <- function(x){
lgamma(kx+(x-1))-(lgamma(kx) + lgamma(x+1))+(x-1)log(r0/k)-
(kx+(x-1))*log(1+r0/k)
}
calc_sizes <- unique(c(1, Y))
likelihoods <- c()
likelihoods[calc_sizes] <- sapply(calc_sizes, nb.likelihood)
sexpl <- sum(exp(likelihoods), na.rm=TRUE)
if (sexpl<1) {
maxl <- log(1-sum(exp(likelihoods), na.rm=TRUE))
} else {maxl <- -Inf}
likelihoods <- c(likelihoods, maxl)
cluster_likelihoods <- likelihoods[Y]
result <- sum(cluster_likelihoods)
return(result)
}
function to estimate parameters####
param.est <- function(y){
ret <- c()
r0_hat <- 1-(1/mean(y))
if (r0_hat < 1e-9) {r0_hat <- 1e-9}
t <- optim(par=c(0.01, 0.01), lower=c(1e-9, 1e-9),
fn=function(x) {-clust.likelihood(y, x[1],
r0_hat)}, method="L-BFGS-B")
k_hat <- t$par[[1]]
ret$k_hat <- k_hat; ret$r0_hat <- r0_hat
ret <- unlist(ret)
return(ret)
}
estimate parameters
param.est(Y)
code to generate new data
randomly "censor" 10% of values
cens.status <- ifelse(runif(10000)>0.9, 1, 0)
Y.new <- cbind(Y,cens.status)
For reference, $P(Y=y)=\frac{\Gamma(ky+y-1)}{\Gamma(ky)\Gamma(y+1)}\frac{{\frac{R_0}{k}^{y-1}}}{(1+\frac{R_0}{k})^{-ky-y+1}}$