2

I am trying to fit a proportional hazards regression model to case-cohort data (where cases are oversampled and not representative of the population). I am using the Survival package and cch() function to fit a prentice-weighted PH cox-regression.

fit <- cch(Surv(followuptime, event) ~ A1 + A2 + A3, data =datadf, stratum=NULL, subcoh = ~subcoh, id=~id, cohort.size=4512, method="Prentice") 

The A1, A2, A3 are metabolites, and I have 812 of them. However, I am interested in training the model on a training subset of the data, and using the predict function to get a "score" for each subject in the validation set. I later want to split the subjects according to whether the score was higher or lower than than median and plotting the Kaplan-Meier curve. However, I am unable to use a predict function for a cch object. Is there an alternative to do that?

I have also looked at using the coxphw package which implements Prentice weights and has a predict function. However, I am unclear on how it works in the context of a case-cohort data given that there is no argument to indicate the sub cohort groups or the cohort size.

fit <- coxphw(Surv(followuptime, event) ~ A1 + A2 + A3, data = datadf, template = "AHR"). 

I would appreciate any advice on this. I was also wondering how one would approach cross-validating such a model. I have read a lot of the literature and I am unclear on what to do next, as I am struggling to find methods/functions/packages that fit what I need.

Information on dataset: My dataset is from a case-cohort study. Cases (n=98 cases in total) and a random sub-cohort (n=325) were included in the case-cohort study. There were 12 cases in the sub-cohort of 325. As such, the final dataset included 301 controls, and 98 cases (total n=399).

Thank you so much!

1 Answers1

1

Predictions from case-cohort survival studies

The coxphw package probably doesn't fit your application; the weighting there is designed to take censoring probabilities into account so that you can get reasonable estimates of average hazard ratios even when the proportional hazards assumption doesn't hold. It's not designed specifically to handle sampling weights.

Better tools for your application are available in the R survey package, which handles multiple sampling approaches and types of statistical analysis. That package requires a specific form for specifying the sampling design. Case-cohort survival studies are illustrated in the two-phase design vignette. Its svycoxph() function, unlike the cch() function, returns an object that inherits from coxph objects and thus can be used to generate predictions.

Training/test data split

Don't do it with such a small data set. See this web page for details why. Fundamentally, your training set will then give you more imprecise estimates of the coefficients and the test set will be too small to give robust estimates of model performance. The best approach is to build your model with all your data and then evaluate the performance of the modeling approach via resampling. For example, you can repeat the modeling on multiple bootstrap samples of the data, then evaluate performance of all those models on the full data set. See below, however.

812 metabolites

Your 98 events set some limits on what you can do here. To avoid overfitting you should only be evaluating about 5 to 10 predictors without some type of penalization, based on a rule of thumb of 10 to 20 events per predictor in survival studies.

If A1, A2 and A3 already represent some substantial dimension reduction from your 812 metabolites, for example the first 3 principal components of the metabolite values, then you should be OK. You would be wise to include outcome-associated clinical variables in addition to the metabolite data to document that the metabolite data is adding something useful to what can already be determined clinically. Note that Cox models have omitted-variable bias even if omitted outcome-associated predictors are uncorrelated with the included predictors. Thus including clinical covariates could actually improve your ability to find outcome-associate metabolite information.

If you want to build such models with all 812 of your metabolites, however, you will be facing some substantial problems with overfitting and multiple-comparison problems.

Resampling

I don't know enough about resampling (bootstrap or cross validation) in case-cohort designs to provide a useful answer to that part of your question. I'm not familiar enough with case-cohort sampling to know whether the bootstrap methods in the survey package are applicable to your study. I found a paper by Y. Huang on "Bootstrap for the case-cohort design," Biometrika, June 2014, Vol. 101, No. 2, pp. 465-476 that might help. If that doesn't help or you don't get another answer to that part here, you might consider asking a separate question specific to resampling in case-cohort studies.

EdM
  • 92,183
  • 10
  • 92
  • 267