2

I am using a logistic regression model to predict breast cancer. I trained and tested the model in a population with a pretty high incidence of breast cancer(since the individuals all went to the hospital and got tested), like 30% of the people have cancer. But then it hit me that what if I use this model in a population where the incidence of breast cancer is low--including people who underwent the screening but turned out to be healthy(they were excluded based on some criteria in this research). Will the accuracy of the model get decrease? Will it make my result look better if I increase the threshold of the logistic model? I'm trying to defend myself in terms of this issue and I really need some ideas and help.

To sum up, my question is whether the model accuracy has anything to do with the incidence of cancer (whether there is any evidence to defend that the model accuracy will remain decent in a population with a low incidence of disease), and is there any way to make my result look better?

  • So you use logistic regression estimate the probability of cancer, P{y|x}, then you apply a threshold to convert the probability into a decision, has cancel or not? And you've chosen the threshold on the hospital dataset? – dipetkov Apr 14 '22 at 11:59
  • Also you might be overthinking the practical aspect to some extent: How are you going to apply your model to people who haven't gone to the hospital for a checkup or a concern that they might have cancer? – dipetkov Apr 14 '22 at 12:01
  • I fitted a logistic model and created a table consisting of values(e.g. sensitivity, specificity, accuracy, etc) to show my results. Someone questioned me about the incidence of disease because I fitted the model in a dataset where the incidence is about 30%. I'm not sure how to defend myself in this case and I am thinking about various ways to present my results better. – user10386405 Apr 14 '22 at 12:10
  • Also, I will edit my post. I meant the healthy people who underwent screening. They did go to the hospital but turned out to be very healthy. – user10386405 Apr 14 '22 at 12:12
  • It would help if you said what threshold you use to make the hard classification based on the probability outputs of the logistic regression. – Dave Apr 14 '22 at 13:00
  • I used 0.5 threshold – user10386405 Apr 15 '22 at 01:50
  • Why did you use a threshold of $0.5?$ (“I don’t know,” is one possible answer.) – Dave Apr 15 '22 at 02:13
  • It just feels right..............I guess my answer is "I don't know". Actually I also listed the results when the threshold is 0.1, 0.2, 0.3, 0.4 as a supplement. Someone said I could consider trying to increase the threshold and present the result and I don't know why. – user10386405 Apr 15 '22 at 11:50

1 Answers1

1

That criticism of your approach is valid. Think of it this way: if the model thinkss that breast cancer is fairly common, it will not be so skeptical about someone having breast cancer.

We can think in terms of Bayes' theorem, where the logistic regression models $P(Y=1\vert X=x)$.

$$ P(Y=1\vert X=x) = \dfrac{P(X=x\vert Y=1)P(Y=1)}{P(X=x)} $$

As you change the disease incidence (prior probability), $P(Y=1)$, you change the posterior probability, $P(Y=1\vert X=x)$.

Dave
  • 62,186
  • So you meant the incidence actually has an effect on the model. Would "Incidence is an attempt to quantify a background probability—an unconditional probability of disease for an entire population" be a good defense? – user10386405 Apr 15 '22 at 11:52