0

I need to build an overall survival regression model that can show the survival plots for id's with multiple rows.

I have used the Weibull function to define the baseline hazard function h0(t) and Cox hazard model to evaluate the effect of covariates.

Then I evaluated the hazard function h(t), followed by cumulative hazard function H(t), and finally the survival function S(t).

Can somebody please let me know if the steps that I have followed is correct or not in Python?

Here is what I have done:

Step-1: Import libraries

#Load the required libraries
import pandas as pd
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 1000)
import numpy as np
import matplotlib.pyplot as plt
from lifelines import WeibullFitter, CoxTimeVaryingFitter
from scipy import integrate

Step-2: Create dataset

The dataframe has 4 id's/machines each having multiple rows and columns as 'cycle', 'start','Covariate', and 'breakdown'.

id-1 has 8 cycles.

id-2 has 6 cycles.

id-3 has 10 cycles.

id-4 has 4 cycles.

At the end of each cycle the event/breakdown of that specific id/machine is 1

data = {'id': [1, 1, 1, 1, 1, 1, 1, 1,
           2, 2, 2, 2, 2, 2,
           3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
           4, 4, 4, 4],
    'cycle': [1, 2, 3, 4, 5, 6, 7, 8,
              1, 2, 3, 4, 5, 6,
              1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
              1,2,3,4],
    'start': [0, 1, 2, 3, 4, 5, 6, 7,
              0, 1, 2, 3, 4, 5,
              0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
              0, 1, 2, 3],
    'Covariate: x': [1.4,  1.9,  2.7,  3.8,  5.3,  7.4,  10.3,  14.4,
                  1.3,  1.6,  2.1,  2.7,  3.5,  4.2,
                  1.0,  1.1,  1.2,  1.3,  1.4,  1.5,   1.6,   1.8,  1.9,  2.1,
                  0.5,  1.0,  1.5,  2.0],
    'breakdown': [0, 0 ,0, 0, 0, 0, 0, 1,
                  0, 0, 0, 0, 0, 1,
                  0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
                  0, 0, 0, 1],
    }

#Convert to dataframe df = pd.DataFrame(data) print("df = \n", df)

The above dataframe looks as such:

enter image description here

Here, the plot of cycle vs covariate/feature for each id/machine until breakdown is as shows:

enter image description here

Step-3: Instantiate CoxTimeVaryingFitter() to calculate partial hazard ratios: HR = exp(x−x¯)'β

The partial hazard ratios is represented as:

enter image description here

The value of "exp(x−x¯)'β" is evaluated as such:

## Instantiate the Cox Time Varying Fitter
## Reference: https://lifelines.readthedocs.io/en/latest/fitters/regression/CoxTimeVaryingFitter.html
ctv = CoxTimeVaryingFitter(penalizer=0)

Fit the Cox Time Varying Fitter model

ctv.fit(df, id_col="id", event_col='breakdown', start_col='start', stop_col='cycle', show_progress=True)

Print the summary of the model

ctv.print_summary()

Plot the Regression coeff β

fig_verify = plt.figure(figsize=(15,7)) ctv.plot() plt.title("Regression coeff β") plt.show()

Evalauate the partial hazard ratios: HR = exp(x−x¯)'β

df_LogPartialHazard = ctv.predict_partial_hazard(df)

Append the partial hazard ratios (HR) to the dataframe

df['exp(x−x¯)′β'] = df_LogPartialHazard

The dataframe now looks as such:

enter image description here

Step-4: Instantiate WeibullFitter() to calculate the scale (λ) and shape parameters (ρ)

Here, we assume that the baseline hazard function for all the rows of a particular id is same.

The baseline hazard function is represented as:

enter image description here

The scale (λ) and shape parameters (ρ) are evaluated by considering the last row of every id when the machine breakdown by Instantiating the WeibullFitter().

## Get the last value of ['cycle','breakdown'] for every 'id' 
weibull_data = pd.DataFrame(df.groupby('id')[['cycle','breakdown']].max()).reset_index()

The above data for WeibullFitter() looks as such:

enter image description here

## Instantiate WeibullFitter class
## Reference: https://lifelines.readthedocs.io/en/latest/fitters/univariate/WeibullFitter.html
wbf = WeibullFitter()

##Fit the WeibullFitter to data wbf.fit(weibull_data['cycle'], weibull_data['breakdown'])

WeibullFitter summary

print("\n wbf.summary = \n",wbf.summary)

WeibullFitter model parameters

print("\n Scale parameter = λwbf = ",wbf.lambda) print("\n Shape parameter = ρwbf = ",wbf.rho)

AIC of Weibull model

print ("\n Akaike information criterion = ", wbf.AIC_)

Loglikelihood for weibull dist

wei = wbf.log_likelihood_ print("\n Loglikelihood = ", round(wei,2))

The shape and scale parameters are evaluated as:

enter image description here

Step-5: Create baseline hazard function [ho(t)] for all the rows in the dataset using the scale (λ) and shape parameters (ρ)

## Create baseline hazard function [ho(t)] for all the rows in the dataset
def baseline_hazard(t,ρ,λ):
    h_0 = (ρ/λ) * ((t/λ)**(ρ-1))
    return h_0

Baseline hazard function

df['ho(t)'] = baseline_hazard(t = df['cycle'] ,ρ = wbf.rho_, λ = wbf.lambda_)

The dataframe looks as such:

enter image description here

The baseline hazard function 'ho(t)' for all the id's looks as such:

enter image description here

Step-6: Create hazard function [h(t)] for all the rows in the dataset

The hazard function is given as such

enter image description here

## Hazard function: h(t) = ho(t)* exp(Xβ)
df['h(t)'] = df['exp(x−x¯)′β'] * df['ho(t)']
print("\n df = \n",df)

The dataframe looks as such:

enter image description here

The hazard function 'h(t)' for all the id's looks as such:

enter image description here

Step-7: Calculate the cumulative hazard function H(t)

The cumulative hazard function H(t) is represented as:

enter image description here

Here we apply the simpson integration rule to evaluate the H(t) for every id

## Create function to evaluate cumulative hazard finction H(t)
def CUMULATIVE_HAZARD(complete_dataframe):
## Get the total number of cycles for each id 
grouped_by_unit = complete_dataframe.groupby(by="id")

## Create empty array for all the unique number of id's
Cum_Haz_func_id = np.empty((np.int64(np.shape(complete_dataframe['id'].unique())[0]),0))   

## Iterate over unique umber of id's
for j in range (0,len(complete_dataframe['id'].unique())) :   

    ## Rows of a particular id
    df_each_id = grouped_by_unit.get_group(j+1).reset_index()

    ## Empty array for rows of a particular id
    Cum_Haz_func_j_id = np.empty((np.int64(np.shape(df_each_id.index)[0]),0))               

    ## Iterate over all the rows in a particular id
    for i in range (0,len(df_each_id.index)) :
        x = df_each_id['h(t)'][0:df_each_id.index[i]+1].reset_index()
        X = x['index']
        Y = x['h(t)']

        ## Integration via simpson rule
        Cum_Haz_func_in_j_ID_i_DataPoint = integrate.simps(Y, X)

        ## Append for the rows of a specific id
        Cum_Haz_func_j_id = np.append( Cum_Haz_func_j_id, Cum_Haz_func_in_j_ID_i_DataPoint)

    ## Append for all the rows for all the id's
    Cum_Haz_func_id = np.append( Cum_Haz_func_id, Cum_Haz_func_j_id)

return Cum_Haz_func_id    


Cumulative Hazard function

df['H(t)'] = CUMULATIVE_HAZARD(df)

The dataframe looks as such:

enter image description here

And the Cumulative Hazard function 'H(t)' for each id looks as such:

enter image description here

Step-8: Calculate the survival function S(t)

The survival function is written as:

enter image description here

## Survival function
df['S(t)'] = np.exp(-df['H(t)'])
print(" \n df = \n", df)

The above dataframe looks as such:

enter image description here

The survival function of each id looks as:

enter image description here

Now, can somebody please let me know if the above process to obtain the survival function plot for each id is correct or not?

Edit-1

The cumulative baseline hazard can be extracted from CoxTimeVaryingFitter() as such:

## Cumulative Baseline Hazard: Ho(t)
CTV_Baselines = pd.concat([ctv.baseline_cumulative_hazard_,    ## Baseline Cumulative Hazard function: Ho(t)
                           ], axis=1).reset_index()

CTV_Baselines.columns = ['Last_cycle', 'Ho', ]

print(" \n CTV_Baselines = \n", CTV_Baselines)

The cumulative baseline hazard looks as such:

enter image description here

1 Answers1

1

As I understand your code, there’s a problem with your baseline hazard estimate. It should be the same for all individuals and represent the situation when all covariates are at their mean values. Instead you seem to estimate separate baseline hazards for each individual based on the covariate value at failure.

You could instead extract the cumulative baseline hazard from your initial Cox model and find the best-fitting Weibull parameters to match that.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • Thanks a lot @EdM for your valuable suggestions. I have extracted the cumulative baseline hazard as mentioned in Edit-1. Accordingly, I have tried to find the best-fitting Weibull parameters in Python which has been mentioned here: https://stackoverflow.com/questions/75602968/calculation-of-shape-parameter-and-scale-parameter-from-vaues-of-cumulative-haza Could you please give your views on whether my understanding is correct or not. Also, could you please let me know the general procedure of how to extract the cumulative baseline hazard? – NN_Developer Mar 01 '23 at 11:31
  • 1
    @NN_Developer this page outlines how the cumulative baseline hazard is extracted after a Cox model has been fit. Depending on the reference, the method is given a name including "Nelson," "Aalen," or "Breslow," for those who helped develop this counting-process analysis of cumulative hazards or extended it to Cox models. I don't use Python enough to comment on your curve fitting in the Stack Overflow page that you link. Be careful in Weibull fitting, as there are different parameterizations. – EdM Mar 01 '23 at 14:50
  • Thanks a lot, @EdM for your feedback. – NN_Developer Mar 02 '23 at 04:54