In the code below I test several parametric distributions against the lung dataset from the survival package data for best fit.
My main question with respect to the below code is my use of fitdist(lung$time, dist). Is this the correct usage of the fitdist() function from the firstdistrplus package? Where the only data input into the model is the "time" column?
During my research as I drafted this code I came across examples where only the "time" values are input into the fitdist() function and there are further notes that "...the fitdistrplus::fitdist() function can handle censored data without explicitly specifying the censoring status. The function assumes that any value greater than the largest observed event time is right-censored..." which doesn't make sense to me. The lung dataset has a "status" column where 1=censored, 2=dead. Why wouldn't this extra information be used?
I reviewed the package documentation for the fitdist() function and the example dataset used is groundbeef$serving which doesn't have the time and status elements of the lung dataset.
Alterenatively, if fitdist() ignores censoring status, should I be instead using another parametric distribution fitting function from another package?
Code:
library(fitdistrplus)
library(survival)
Create vector of parametric distributions to test
distList <- c("weibull", "exp", "gamma", "lnorm")
Function fits each distribution and extracts AIC, BIC, log-likelihood values
fit_dist <- function(dist) {
tmp <- fitdist(lung$time, dist)
c(tmp$aic, tmp$bic, tmp$loglik)
}
Apply above function to each distribution in the distList
results_list <- lapply(distList, fit_dist)
Convert the above list of results to a data frame
results_df <- data.frame(t(matrix(unlist(results_list), nrow = 3)))
colnames(results_df) <- c("aic", "bic", "logLik")
rownames(results_df) <- distList
Find the distribution with the lowest AIC, BIC, and logLik values
bestFitAIC <- rownames(results_df)[which.min(results_df$aic)]
bestFitBIC <- rownames(results_df)[which.min(results_df$bic)]
bestFitLogLik <- rownames(results_df)[which.max(results_df$logLik)]
Print the results data frame and best fitting distributions
results_df
cat("\nBest fitting distribution using AIC:", bestFitAIC, "\n")
cat("\nBest fitting distribution using BIC:", bestFitBIC, "\n")
cat("\nBest fitting distribution using Log-Likelihood:", bestFitLogLik, "\n")