2

Background

I have a collection of functions with trainable parameters that I am implementing as Keras model classes, which enables immediate use of a variety of objective functions, optimizers, and training-related methods (e.g. early stopping callback).

These functions take a single variable, output a single variable, and have no more than a dozen parameters. The number of explicitly-written operators ('+', '-', '*', '/', 'exp', 'log', 'arctan') is also around a dozen, although I caution that this measure of model complexity is unreliable (i.e. equivalent expressions could have greater or fewer the number of explicitly-written operators). The point is that these are not the enormously-complex models like those used in deep learning. Let the following exemplify this description.

Example

Verhulst growth model $$P(t) = \frac{K}{1+ \left( \frac{K-P_0}{P_0} \right) \exp \{-rt\}}$$ where $P(t)$ is a population size at time $t$, $K$ is the carrying capacity, $P_0$ is the initial population size, and $r$ is the "unimpeded" exponential rate constant.

Problem Statement

I started off with random initialization of paramters by sampling from either a standard normal distribution or a uniform distribution over $[0,1]$. But I have encountered the following issues for which I have made only partial progress on addressing:

  • Non-convexity of loss function (often mean-squared error) over the parameters of many of these models, combined with sampling the parameter space near the boundaries of convex regions, has resulted in parameter estimations that simply started in the wrong "valley".
  • If an initial parameter is quite far away from its optimal value, even within the same convex region, it can take an extremely long time to converge.

I have found studying 3D surface plots and contour plots of the loss function over pairs of parameters useful, along with Hessian-based tests of convexity. For sufficiently small datasets and simple models, I have found it possible to copy-paste the data into tools like Desmos calculator and manually tune parameters, but this does not scale. I have room to grow on this subject, and a source that accelerates my learning could make a tangible difference in my productivity in building the training methods of my models.

Question

Does there exist a guide for designing self-starter estimators of such parametric functions?

Galen
  • 8,442
  • 2
    R has a concept of selfStart functions for this purpose, see for instance https://stackoverflow.com/questions/19013180/self-start-function-in-r and search that site for ideas! – kjetil b halvorsen Nov 25 '21 at 14:35
  • @kjetilbhalvorsen The link appears to use linear regression to approximate initial parameters. I am doing something similar with the Verhulst model: $y = mx + b$ is a fitted simple linear model, and I assign $r := \text{sign} (m) \ln |m|$. This approach is much better than initializing by sampling a value from a standard normal, however I have found it often produces $r$ values that are larger in absolute value than desired. – Galen Nov 25 '21 at 15:01
  • 1
    If you look at other selfStart-functions, you might find other approaches. You don't need a perfect method, only a value sufficiently good that the following iterations converge – kjetil b halvorsen Nov 25 '21 at 15:04
  • @kjetilbhalvorsen I agree. A formalization of some desired criterion might include that the estimate (1) sits within an order of magnitude (for a chosen base) of the optimal value and (2) sits within the same convex region as the optimal value. But really, as long as an estimator works well in practice that matters more than adhering to these criteria. – Galen Nov 25 '21 at 15:09

1 Answers1

0

There is a general method which works for a lot of these simple types of models. However for some reason it seems that most people somehow just can't or won't see how easy they are. The self-starting approach for these models in r is wrong because the real problem is to find a stable parameterization for the models. Once this is done finding good intial values is easy. Having fit the model with the parameters for the stable parameterization it is simple to calculate the values/std devs of the parameters of interest.

Finding Initial Parameter Values for Gompertz Model with Intercept

I did that example in some detail so I will just explain the correct parameterization here. The horrible parameter that you have is K. Of course it is also a parameter of great interest.

Look at the data. Clearly the population size for the the first and last time are much better determined in general than the carrying capacity K. So Let $P_0$ and $P_n$ be the parameters for the population in the first and last time period.

$$P_t=\frac{K}{1+\alpha\exp(-rt)}$$

I have simplified the equation using just $\alpha$ so that this is a more general approach more similar to the Gompertz example. \begin{aligned} P_0&=\frac{K}{1+\alpha} \cr P_n&=\frac{K}{1+\alpha\exp(-rt_n)} \end{aligned} Now one can solve the first equation for $K$ in terms of $P_0$ and $\alpha$ $$ K=P_0(1+\alpha)$$ and substitute this into the second equation to get $$P_n=\frac{P_0(1+\alpha)}{1+\alpha\exp(-rt_n)}$$ which can be solved for $\alpha$ in terms of $P_0,P_n$, and $r$.

\begin{aligned} \alpha&=\frac{P_0-P_n}{P_n\exp(-rt_n)-P_0}\cr K&=P_0\big\{1+\frac{P_0-P_n}{P_n\exp(-rt_n)-P_0}\big\} \end{aligned}

Now one can fix the parameters $P_0$ and $P_n$ and get an initial estimate for $r$ conditioned on the fixed values of $P_0$ and $P_n$ using the parameter estimator of choice. Then the three parameters $P_0$, $P_n$, and $r$ can be estimated simultaneously to get the complete solution. This method also works for the von Bertalanffy and the 4 or 5 parameter logistic although it is a bit simpler for those equations due to the linear nature of the parameters involved.