How to correctly select parametric distribution that best fits survival data?

Question

In the code below I test several parametric distributions against the lung dataset from the survival package data for best fit.

My main question with respect to the below code is my use of fitdist(lung$time, dist). Is this the correct usage of the fitdist() function from the firstdistrplus package? Where the only data input into the model is the "time" column?

During my research as I drafted this code I came across examples where only the "time" values are input into the fitdist() function and there are further notes that "...the fitdistrplus::fitdist() function can handle censored data without explicitly specifying the censoring status. The function assumes that any value greater than the largest observed event time is right-censored..." which doesn't make sense to me. The lung dataset has a "status" column where 1=censored, 2=dead. Why wouldn't this extra information be used?

I reviewed the package documentation for the fitdist() function and the example dataset used is groundbeef$serving which doesn't have the time and status elements of the lung dataset.

Alterenatively, if fitdist() ignores censoring status, should I be instead using another parametric distribution fitting function from another package?

Code:

library(fitdistrplus)
library(survival)
Create vector of parametric distributions to test
distList <- c("weibull", "exp", "gamma", "lnorm")
Function fits each distribution and extracts AIC, BIC, log-likelihood values
fit_dist <- function(dist) {
  tmp <- fitdist(lung$time, dist)
  c(tmp$aic, tmp$bic, tmp$loglik)
}
Apply above function to each distribution in the distList
results_list <- lapply(distList, fit_dist)
Convert the above list of results to a data frame
results_df <- data.frame(t(matrix(unlist(results_list), nrow = 3)))
colnames(results_df) <- c("aic", "bic", "logLik")
rownames(results_df) <- distList
Find the distribution with the lowest AIC, BIC, and logLik values
bestFitAIC <- rownames(results_df)[which.min(results_df$aic)]
bestFitBIC <- rownames(results_df)[which.min(results_df$bic)]
bestFitLogLik <- rownames(results_df)[which.max(results_df$logLik)]
Print the results data frame and best fitting distributions
results_df
cat("\nBest fitting distribution using AIC:", bestFitAIC, "\n")
cat("\nBest fitting distribution using BIC:", bestFitBIC, "\n")
cat("\nBest fitting distribution using Log-Likelihood:", bestFitLogLik, "\n")

score 2 · Accepted Answer · answered May 06 '23 at 13:06

Fitting a distribution to censored data is somewhat different from fitting to uncensored data. The likelihood calculation with a censored observation involves the survival function of the distribution, not the probability of the observed value. See this page for an introduction to the likelihood contributions. In the package to which you refer, uncensored data are handled by the fitdist() function while censored data require the fitdistcens() function.

Although those functions might help with simple data sets, as I understand it they won't help much with finding the best underlying distribution for a regression model involving covariates and censored outcomes. For a wide class of such models you can evaluate whether the distribution of (censored) residuals matches the associated error distribution. These course notes show error distributions associated with several types of parametric survival models. This page provides a brief introduction to calculating residuals in parametric models. Chapters 18 and 19 of Harrell's Regression Modeling Strategies discuss parametric modeling of survival data in detail, with an extensively worked-through example.

score 0 · Answer 2 · answered May 07 '23 at 07:19

For now I will be using the flexsurvreg() function of the flexsurv package, for fitting parametric distributions to right-censored data such as the lung dataset from the survival package. Below is the current state of code for goodness-of-fit tests for several parametric distributions.

I highly recommend studying Chapters 18 and 19 of Harrell's Regression Modeling Strategies recommended by EdM as well as https://grodri.github.io/survival/ParametricSurvival.pdf!

Code:

library(flexsurv)
library(survival)
data(lung)
Create vector of parametric distributions to test
distList <- c("weibull", "exp", "gamma", "lnorm","llogis","gompertz")
Function fits each distribution and extracts AIC, BIC, log-likelihood values
fit_dist <- function(dist) {
  tmp <- flexsurvreg(Surv(time, status) ~ 1, data = lung, dist = dist)
  c(AIC(tmp), BIC(tmp), as.numeric(logLik(tmp)))
}
Apply above function to each distribution in the distList
results_list <- lapply(distList, fit_dist)
Convert the above list of results to a data frame
results_df <- data.frame(t(matrix(unlist(results_list), nrow = 3)))
colnames(results_df) <- c("aic", "bic", "logLik")
rownames(results_df) <- distList
Find the distribution with the lowest AIC, BIC, and logLik values
bestFitAIC <- rownames(results_df)[which.min(results_df$aic)]
bestFitBIC <- rownames(results_df)[which.min(results_df$bic)]
bestFitLogLik <- rownames(results_df)[which.max(results_df$logLik)]
Print the results data frame and best fitting distributions
results_df
cat("\nBest fitting distribution using AIC:", bestFitAIC, "\n")
cat("\nBest fitting distribution using BIC:", bestFitBIC, "\n")
cat("\nBest fitting distribution using Log-Likelihood:", bestFitLogLik, "\n")

How to correctly select parametric distribution that best fits survival data?

Create vector of parametric distributions to test

Function fits each distribution and extracts AIC, BIC, log-likelihood values

Apply above function to each distribution in the distList

Convert the above list of results to a data frame

Find the distribution with the lowest AIC, BIC, and logLik values

Print the results data frame and best fitting distributions

2 Answers2