2

The setup I describe below is analogous to my actual problem.

Problem: I have millions of individuals in my dataset and for each individual, I have certain stats over time. I.e. for one individual Joe, for each month, I have certain measurements about Joe that change through time (age, BMI, and other medical stats) and I have data on certain economic conditions (income of his area, proximity to nearest hospital, etc) that also change with time. Additionally, in my training data, I have whether or not the individual survives each month.

A third-party will select groups of 1000 individuals and group them together. The dataset contains past defined groupings. As a goal, given a new group, I would like to predict the proportion of survivals in the future (for many months out).

Approach 1 (Logistic Regression/Trees): Because each individual outcome is independent, we could predict the probability of death for each individual and aggregate up to a group level to see what proportion of the group survived.

Issues: While I know the economic and health conditions at each time period, I do not know the future state of these factors/features. This limits how far into the future I can make predictions using regressions.

Approach 2 (Time Series + Regression): Another option is that I make the prediction at a group level and engineer new features such as averaging age and BMI across individuals to get the averaging BMI for the group and computing the number of individuals remaining in the group at each month. Doing this allows me to potentially consider time series techniques (you can consider the survival probability for a particular group as a time series as well as incorporating regression techniques using engineered features)

Issues: There is no longer probability of survival for each individual which would be helpful for interpretability. The distribution of BMIs is condensed into a mean/standard deviation and comes with information loss.

I was mainly wondering if anyone has encountered a problem similar to this before and what techniques would be helpful. Is it more naturally a time series problem or a regression problem?

Froozle
  • 75
  • You may be interested in our [tag:survival] tag. Survival analysis is a special category of problem for which we have dedicated methods. – mkt Sep 07 '22 at 06:26
  • 1
    Thank you! I will take a look at those questions and have added the tag to this one. – Froozle Sep 07 '22 at 12:33

1 Answers1

0

There's a problem with aggregating data (Approach 2) in this context: the survival estimate for an "average" set of covariate values isn't necessarily the average of survival estimates among members of the group. The relationships tend to be too non-linear to rely on that.

Something related to Approach 1 makes sense. With a discrete set of time points, discrete-time survival analysis is essentially binomial regression over time, with the probability of an event during each time period modeled as a function of covariate values in place during that period. For the baseline survival over time at reference covariate values, you can model separately for each time point or impose a semi-parametric or parametric form. There are other methods for continuous-time survival, if that's more appropriate for your data.

That doesn't address your concern about extrapolation, which of course is a problem even with standard time-series methods. You probably want to specify a parametric baseline survival form or a smooth semi-parametric form. For example, modeling time with restricted cubic splines enforces linearity beyond the outermost "knots." You can model the trajectories of covariates over time and extrapolate them.

There is a fair amount of work on joint modeling of covariates and events, which might help you. I don't know much about it. The last paragraph of this answer has links that might be useful.

EdM
  • 92,183
  • 10
  • 92
  • 267