8

I've begun working with the Gumbel distribution and started with the example in the package documentation at https://cran.r-project.org/web/packages/fitdistrplus/fitdistrplus.pdf. For the Gumbel distribution fitdist() requires start parameters (template is fitdist(data, distr, method = c(...), start=NULL,...)), where it's described as "A named list giving the initial values of parameters of the named distribution or a function of data computing initial values and returning a named list. This argument may be omitted (default) for some distributions for which reasonable starting values are computed..." Gumbel is not on the list of distributions where start parameters may be omitted. In the below code I use the start parameters from the reference manual example and it works, but that's random. If I use start parameters of start = list(1,1) for example the function doesn't work.

Is there a way to generate these start parameters automatically? When I find start parameters that work the output is indifferent for example whether I use (5,5) or (500,500), so a hack solution would be to randomly generate parameter values until the function works. But I'm hoping for a cleaner, non-hack solution.

By the way I realize Gumbel is not a good fit for lung dataset! I'm just using lung for sake of example.

Code:

library(evd)
library(fitdistrplus)
library(survival)

Gumbel distribution

time <- seq(0, 1000, by = 1) deathTime <- lung$time[lung$status == 1] fitGum <- fitdist(deathTime, "gumbel",start=list(a=10,b=10)) survGum <- 1-evd::pgumbel(time, fitGum$estimate[1], fitGum$estimate[2])

plot(time,survGum,type="n",xlab="Time",ylab="Survival Probability", main="Lung Survival") lines(survGum, type = "l", col = "red", lwd = 3) # plot Gumbel

  • 2
    A general answer to this question is provided in my post at https://stats.stackexchange.com/questions/160552. For distribution fitting specifically, people often use simple but potentially inferior methods to obtain initial estimates: consider method of moments or methods based on order statistics, for instance. There's still no guarantee, because (as you note) when the distribution is a poor description of the data, all bets are off. – whuber May 26 '23 at 13:51
  • 3
    Following @whuber, method of moments seems pretty simple for this distribution. Using the moment equations from wikipedia, we would have $s=\frac{\sqrt{6}}{\pi}\sigma$ and $l=\mu-\gamma s$ as our initial values, where $l,s$ are the location and scale parameters (as parameterized by wiki), $\mu$ and $\sigma$ are the sample mean and stdev, and $\gamma\approx 0.57$ is the Euler-Mascheroni constant. (I have not double checked this math). – John Madden May 26 '23 at 14:24
  • 3
    @JohnMadden that's for the maximum extreme value form. There's always ambiguity when someone just says "Gumbel distribution" or "extreme value distribution" without also specifying "maximum" or "minimum." I suspect that the "minimum" form is at issue here. – EdM May 26 '23 at 14:53

2 Answers2

12

The NIST page on Gumbel distributions shows the method of moments estimators for the parameters of both the maximum and minimum extreme-value distributions. Those are easily calculated and should provide reliable initial estimates.

The parameterization of the density function for the minimum extreme value distribution on that page is:

$$f(x) = \frac{1}{\beta} \exp({\frac{x-\mu}{\beta}})\exp({-\exp({\frac{x-\mu}{\beta}}}))$$

with location parameter $\mu$ and scale parameter $\beta$.

In that parameterization, with sample mean $\bar X$ and sample standard deviation $s$, the method of moments estimators are:

$$\tilde{\beta}=\frac{s\sqrt6}{\pi}$$

and:

$$\tilde{\mu}=\bar X +0.5772 \tilde{\beta} $$

where 0.5772 is an approximation to Euler's constant.

That's used by survreg() in R, which you can see by typing the following at the command prompt:

survival::survreg.distributions$extreme$init
EdM
  • 92,183
  • 10
  • 92
  • 267
6

Putting EdM's answer into code, which seems to work well and is very concise:

library(evd)
library(fitdistrplus)
library(survival)

time <- seq(0, 1000, by = 1) deathTime <- lung$time[lung$status == 2]

scale_est <- (sd(deathTime)sqrt(6))/pi loc_est <- mean(deathTime) + 0.5772157scale_est

fitGum <- fitdist(deathTime, "gumbel",start=list(a = loc_est, b = scale_est)) survGum <- 1-evd::pgumbel(time, fitGum$estimate[1], fitGum$estimate[2])

plot(time,survGum,type="n",xlab="Time",ylab="Survival Probability", main="Lung Survival") lines(survGum, type = "l", col = "red", lwd = 3) # plot Gumbel

Alternative: in referencing Why is nls() giving me "singular gradient matrix at initial parameter estimates" errors? per whuber's comment, which says to paraphrase: "Automatically finding good starting values for a nonlinear model is an art. (It's relatively easy for one-off datasets when you can just plot the data and make some good guesses visually.) One approach is to linearize the model and use least squares estimates." I came up with the following which appears to work in this case.

library(evd)
library(fitdistrplus)
library(survival)

time <- seq(0, 1000, by = 1) deathTime <- lung$time[lung$status == 2]

Define the linearized model

linearized_model <- function(time, a, b) { log_time <- log(time) log_time - a / b }

Define the objective function for least squares estimation

objective_function <- function(params) { a <- params[1] b <- params[2] predicted_values <- linearized_model(deathTime, a, b) residuals <- predicted_values - log(deathTime) sum(residuals^2) }

Least squares estimation to obtain starting parameters

starting_params <- optim(c(1, 1), objective_function)$par

fitGum <- fitdist(deathTime, "gumbel",start=list(a = starting_params[1], b = starting_params[2])) survGum <- 1-evd::pgumbel(time, fitGum$estimate[1], fitGum$estimate[2])

plot(time,survGum,type="n",xlab="Time",ylab="Survival Probability", main="Lung Survival") lines(survGum, type = "l", col = "red", lwd = 3) # plot Gumbel