2

Suppose I have data in R that looks like this. This data represents measurements of different patients over a period of time (discrete level):

df <- data.frame(patient_id = c(111,111,111, 111, 222, 222, 222), 
                 year = c(2010, 2011, 2012, 2013, 2011, 2012, 2013), 
                 gender = c("Male", "Male", "Male", "Male", "Female", "Female", "Female"), 
                 weight = c(98, 97, 102, 105, 87, 81, 83), 
                 state_at_year = c("healthy", "sick", "sicker", "sicker", "healthy", "sicker", "sicker"))

patient_id year gender weight state_at_year 1 111 2010 Male 98 healthy 2 111 2011 Male 97 sick 3 111 2012 Male 102 sicker 4 111 2013 Male 105 sicker 5 222 2011 Female 87 healthy 6 222 2012 Female 81 sicker 7 222 2013 Female 83 sicker

I am interested in modelling the effect of different patient characteristics on how they transition between different states. To accomplish this, I am thinking of using Discrete Time Markov Cohort Models. Specifically, I am thinking of using the approach provided here (https://hesim-dev.github.io/hesim/articles/mlogit.html) in which:

  • All rows of data are isolated in which the patient starts at state k = 1
  • Then, a Multinomial Logistic Regression is fit on this dataset (note: since we are modelling the probability of transition based on the information we know BEFORE the transition, for any given row - the weight_end variable is never directly modelled)
  • This process is repeated for all other "k" states (excluding recurrent states, e.g. "death)
  • As a result, a series of Multinomial Logistic Regression Models are being used to estimate the time-dependent transition probabilities all states.

To reformat the data for Discrete Time Markov Cohort Models (https://hesim-dev.github.io/hesim/articles/mlogit.html) - I would have to reformat the data in such a way, such that it represents transitions between states:

 patient_id year_start year_end gender_start gender_end state_start state_end weight_start weight_end
1        111       2010     2011         Male       Male     healthy      sick           98         97
2        111       2011     2012         Male       Male        sick    sicker           97        102
3        111       2012     2013         Male       Male      sicker    sicker          102        105
4        222       2011     2012       Female     Female     healthy    sicker           87         81
5        222       2012     2013       Female     Female      sicker    sicker           81         83

structure(list(patient_id = c(111, 111, 111, 222, 222), year_start = c(2010, 2011, 2012, 2011, 2012), year_end = c(2011, 2012, 2013, 2012, 2013), gender_start = c("Male", "Male", "Male", "Female", "Female" ), gender_end = c("Male", "Male", "Male", "Female", "Female"), state_start = c("healthy", "sick", "sicker", "healthy", "sicker" ), state_end = c("sick", "sicker", "sicker", "sicker", "sicker" ), weight_start = c(98, 97, 102, 87, 81), weight_end = c(97, 102, 105, 81, 83)), row.names = c(1L, 2L, 3L, 4L, 5L), class = "data.frame")

It appears as though there is no way to but to eliminate the last row of data for each patient - as this will be the last available transition for that patient. This means, that we will be forced to lose one row of data for each patient.

In cases where the patient experiences an absorbing event (e.g. death) - in these cases, this is not a problem. However, in cases where the patient is "right censored" (i.e. has the event after the end of the study) - there is nothing we can do to account for censoring other than removing the last row of data for each patient. We could try to use some imputation method or assume that patient transitions to the same state that they are currently in - but this is a risky process. As such, it seems like there is no option but to discard the last available row of data (i.e. weight_end value occurring at the last row) for each patient and only keep all complete transitions for each patient.

Is my understanding of this correct?

stats_noob
  • 1
  • 3
  • 32
  • 105
  • 4
    It looks like you have a natural ordering of the states: healthy, sick, sicker, death. Think about whether that might be better served by an ordinal rather than a multinomial model. Also, it's not clear to me why you think that the last day row is being discarded. Please edit the question to add details of model you might be using on the data, and I think that you'll see there isn't a problem. Those last rows provide information to the model even if there is no transition during the time period, as they are evaluated along with other rows for which there is a transition. – EdM Apr 02 '23 at 08:37
  • @ EdM: thank you so much for your reply! Two points that I wanted to add – stats_noob Apr 03 '23 at 01:04
  • To make my question easier to explain, I used "ordinal states" (e.g. sicker can be defined relative to sick) - but in reality, I don't actually have ordinal states
  • – stats_noob Apr 03 '23 at 01:05
  • I was actually reading about one of the answers you provided over here https://stats.stackexchange.com/questions/579564/censoring-in-a-discrete-time-hazard-model-of-time-to-drop-out-of-a-clinical-tria .... What I meant by "discarded" is that the weight_end values for the final rows of patients 111 and patient 222 can not be used in the model ... this is what I meant by discarded. thank you so much!
  • – stats_noob Apr 03 '23 at 01:08
  • It looks like you have one record per year per subject. If the last record corresponds to the last observed record for that subject, then you don't need to do anything. The subsequent unobserved states are a kind of missing data problem that depend on the model specification to handle imbalance between subjects. – AdamO Apr 03 '23 at 12:45
  • Whereas survival models are interpreted as models for incidence, logistic models (multinomial is a kind of logistici model) do not (directly) model incidence, but they model states irrespective of time. "Censoring and truncation" need to be thought about differently in this situation - and I'm with Ed, the "worst" problem here is ignoring that these states are strongly ranked. In cardiovascular studies, subject status is often based on a rank like this, and the analysis will assign a numeric value to the various states (such as dead=20, sicker=4, sick=2, healthy=0). and use LMM. – AdamO Apr 03 '23 at 12:50