I'm planning a Kaplan-Meier analysis and need to determine an appropriate sample size for the study. I have two treatment arms and want to show that there is a differential biomarker effect depending on the treatment - I expect that the biomarker should be predictive of outcome for one treatment, but have no outcome association in the other.
Calculating the required sample size for each arm individually is fairly straightforward - using a tool like the powerSurvEpi package in R, or the calculator here, I'm able to plug in the desired alpha, beta, as well as the expected hazard ratio and biomarker prevalence to get a sample size for each arm. If all goes well, this should give me a good chance of rejecting the null hypothesis that HR=1 in the biomarker-predictive arm, and also gives sufficient power to show that the failing to reject the null in the biomarker-agnostic arm is actually meaningful. These analyses will show whether or not the observed arm-specific biomarker HRs are significantly different from a fixed value of HR=1.
The question is, what sample size do I need to show that the effects between arms are different from each other? Finding a significant result in the biomarker-predictive arm indicates that the confidence interval of the HR does not cross 1, while no significant result in the biomarker-agnostic arm indicates that the CI of the HR does cross 1. These two CIs may overlap quite a bit, though, so I don't think the sample size for each arm individually is sufficient to show that the arms actually behave differently from each other.
Is there a way to calculate the sample size needed to show that a hazard ratio is different from another, uncertain hazard ratio, rather than being different from a fixed value? This approach would seek to directly compare the HRs between both arms. Another idea would be to perform comparisons within the biomarker-positive or biomarker-negative groups rather than treatment arms, showing that biomarker-positive samples perform differently depending on treatment but that biomarker-negative samples do not. Is either approach preferable, and what's the required sample size?