I am trying to understand why large amounts of censoring (ie. many patients are censored) is undesirable in Survival Analysis.
As a proof of concept, suppose there are 5 patients - all patients enter the study at the same time:
- patient1 has event at t1
- patient2 has event at t2
- patient3 drops out of the study at t3
- patient4 has event at t4
- and when the study is over at t5, patient 5 has not had the event
- t5 > t4 > t3 > t2 > t1
(Semi Parametric Approach) Here is my attempt to write the model and likelihood for a Cox-PH regression in this situation:
$$ h(t|X) = h_0(t) \exp(\beta^T X) $$ $$ L(\beta) = \prod_{i: \delta_i = 1} \frac{h(t_i|X_i)}{\sum_{j: t_j \geq t_i} \exp(\beta^T X_j)} $$ $$ L(\beta) = \frac{h(t_1|X_1)}{\sum_{j: t_j \geq t_1} \exp(\beta^T X_j)} \times \frac{h(t_2|X_2)}{\sum_{j: t_j \geq t_2} \exp(\beta^T X_j)} \times \frac{h(t_4|X_4)}{\sum_{j: t_j \geq t_4} \exp(\beta^T X_j)} $$
$$ L(\beta) = \frac{h(t_1|X_1)}{\exp(\beta^T X_1) + \exp(\beta^T X_2) + \exp(\beta^T X_3) + \exp(\beta^T X_4) + \exp(\beta^T X_5)} \times \frac{h(t_2|X_2)}{\exp(\beta^T X_2) + \exp(\beta^T X_3) + \exp(\beta^T X_4) + \exp(\beta^T X_5)} \times \frac{h(t_4|X_4)}{\exp(\beta^T X_3) + \exp(\beta^T X_5)} $$
(Parametric Approach) Here is my attempt to write the model and likelihood for a AFT model in this situation (note that the likelihood is based on the distribution of $\epsilon$ and not the survival times $T$ . I have heard that if we pick $T$ to have distributions such as Exponential or Weibull, then $\epsilon$ results in Extreme Value Distribution such as the Gumbel Distribution ):
$$ \log(T) = \mu + \beta^T X + \sigma \epsilon $$
$$ L(\mu, \sigma, \beta) = \prod_{i=1}^{n} \left[ f\left( \frac{\log(t_i) - \mu - \beta^T X_i}{\sigma} \right) \right]^{\delta_i} \left[ 1 - F\left( \frac{\log(t_i) - \mu - \beta^T X_i}{\sigma} \right) \right]^{1-\delta_i} $$
$$ L(\mu, \sigma, \beta) = \left[ f\left( \frac{\log(t_1) - \mu - \beta^T X_1}{\sigma} \right) \right] \times \left[ f\left( \frac{\log(t_2) - \mu - \beta^T X_2}{\sigma} \right) \right] \times \left[ 1 - F\left( \frac{\log(t_3) - \mu - \beta^T X_3}{\sigma} \right) \right] \times \left[ f\left( \frac{\log(t_4) - \mu - \beta^T X_4}{\sigma} \right) \right] \times \left[ 1 - F\left( \frac{\log(t_5) - \mu - \beta^T X_5}{\sigma} \right) \right] $$
So in the Cox-Ph and AFT model, how is inference (e.g. perhaps it results in high variance, high bias, non-consistency, larger sample sizes to achieve compared results vs smaller sample sizes with lesser censoring) and parameter estimation negatively affected when large numbers of the patients are censored? Does the mathematical optimization become difficult (e.g. incomplete matrix rank, matrix inverses not defined, non-identifiable model)?
