I'm interested about building survival analysis from scratch (ideally in python) to better understand how it works. I haven't found much material, so I was wondering if anyone can share material available online on this topic.
1 Answers
First, make sure that you understand the basics of a survival function. In a case with one event per individual it's just the complement of the cumulative distribution of event times, but that simplicity can get lost when you get deep into details. You should start with a simple situation with only one event (at most) per individual and all individuals having either known event times or right censoring of event times.
To get an introduction to censoring, try coding the Kaplan-Meier estimator of survival in a single cohort. Quoting from the linked Wikipedia page:
The estimator of the survival function $S(t)$ (the probability that life is longer than $t$) is given by:
$$ \widehat S(t) = \prod\limits_{i:\ t_i\le t} \left(1 -\frac{d_i}{n_i}\right),$$
with $t_i$ a time when at least one event happened, $d_i$ the number of events (e.g., deaths) that happened at time $t_i$, and $n_i$ the individuals known to have survived (have not yet had an event or been censored) up to time $t_i$.
That should be fairly easy to code if you start with data sorted by observation (event or right-censoring) times. The Wikipedia page shows how this is related to a maximum-likelihood estimate, and introduces estimates of variance for your further coding pleasure.
To move beyond that and incorporate covariate values into survival estimates, the simplest model to try to code yourself might be a Cox proportional hazards model with covariate values fixed in time and no duplicate event times. As opposed to parametric survival models, this is solved via maximizing a "partial likelihood" that ignores contributions of the baseline survival function to the full likelihood. It's based on calculations only at known event times, and handles censoring by allowing individuals with right-censored event times to contribute to the partial likelihood during time periods when they are at risk of an event.
The Wikipedia page describes the form of the partial likelihood and outlines how to maximize it. That should help illustrate the concept of hazard in survival modeling.
If you want to extend this to fully parametric models and different types of censoring or truncation of event times, this page and its links show the general forms that you can use to get the likelihood of a data set for a parametric survival function.
- 92,183
- 10
- 92
- 267
lifelinesPython package has a reasonable introduction to the concepts, as do the several vignettes on different types of models and issues in the Rsurvivalpackage. – EdM Dec 10 '22 at 19:21