3

I am an MBA Student taking courses in statistics.

We had this discussion today in our class. Suppose we have a large dataset that contains the "Age Group" and "Gender" of people within a country and if they have some specific "Disease or Not".

Using this data, by filtering on different subsets, we calculate that:

  • Suppose the probability of having a specific disease in the entire population is p1
  • Suppose the probability of having a specific disease for the population of men is p2
  • Suppose the probability of having a specific disease for the population of women is p3
  • Suppose the probability of having a specific disease for the population of young people is p4
  • Suppose the probability of having a specific disease for the population of old people is p5
  • Suppose the probability of having a specific disease for the population of old men is p6
  • Suppose the probability of having a specific disease for the population of young men is p7
  • Suppose the probability of having a specific disease for the population of old women is p8
  • Suppose the probability of having a specific disease for the population of young women is p9

Now, imagine we realize that we forgot to collect the data for this one young man. What is the probability that he has this disease?

The obvious answer to this question seems like the probability is p7. But after thinking about this for a while, I had the following ideas:

  • What if the probability of having this disease is significantly more influenced by some variable that is not measured in this dataset (e.g. smoking)?

  • What if some of these subsets have a very small population? (e.g. suppose only 1% of all data is old men)

  • Suppose young men smoke a lot but this forgotten young man does not smoke - the probability of him having a disease might be closer to that of some other subgroup?

For such cases, I thought that it might be better to try and "average out" this out information to safeguard against possible risks. For example - perhaps the probability that this young man has the disease is: (p1 + p2 + p4 + p7)/4 ?

This way, we have taken into consideration hidden trends and patterns within the entire population, the entire population of young people and the entire population of young men. The effects and skewness from possible outliers should now have been more smoothened out?

From my Algebra classes, I know that there is no guarantee that p7 must equal to (p1 + p2 + p4 + p7)/4. This brings me to my question:

In the absence of any other information and prior knowledge on this disease - is it possible that (p1 + p2 + p4 + p7)/4 might be a "less riskier" estimate compared to p7 alone?

Is this a valid estimation procedure, or have I misunderstood how averages and pooled averages are intended to be interpreted and used?

Note 1: Of course the drawback to this pooled approach is that if there are no outliers in the data, then p7 would have been a more accurate estimate and we put ourselves in our a worse position by pooling the averages.

Note 2: If we did in fact have some prior knowledge, it could have been interesting to take a "weighted pooled average" of the probabilities. But in the absence of prior knowledge and additional information, it looks like we have no choice but to assume equal weighting.

stats_noob
  • 1
  • 3
  • 32
  • 105

1 Answers1

3

You can use the p7 probability as an estimate. (or if this estimate is very inaccurate, you can also simply conclude that you do not know enough for an accurate answer and more information should be gathered)

If you somehow want to improve this estimate then you need to add information. Either by more measurements or by information based on theory. The latter can be tricky and create bias.

Example

  • measurement 1: 100 young men among which 20 got sick

    estimate $p_7 = 0.2 \,(s.e. 0.04)$

  • measurement 2: 10 000 old men among which 5 000 got sick.

    estimate $p_6 = 0.5 \,(s.e. 0.005)$

The estimate $p_6$ is much more accurate, but can you honestly use it to predict the probability for a young man to get sick?

By using the $p_6$ figure you added more information, but it might be inaccurate information. You should only use information that makes sense. That is, when you know/assume that it will have a small bias (the acceptable level of 'small' depending on how accurate you desire to be).


A related concept is the bias-variance trade-off

A typical cases in statistics where bias is added is regularised regression. The 'correctness' of the added bias is determined by training and validating a model during which the amount of bias is optimized based on the perfotmance of the model.


Another related concept is Bayesian statistics. It provides a way to update knowledge after acquiring more data. Of course, if there is not a lot of data, then the final quality of the estimate depends a lot on the prior knowledge. You could use the accurate number $p_6$ as prior knowledge, but you would have to reduce the weight that you give to it. How you do this is relatively subjective but in a workflow with increasingly more knowledge and information the subjectivity reduces.

  • @ Sextus: Thank you! I was just wondering - given no other information about the data quality or information about the subject matter .... is the approach I described in my question "(p1 + p2 + p4 + p7)/4" correct? Or would this approach add "noise" compared to just using p7? – stats_noob Oct 15 '22 at 16:09
  • 1
    I am gonna answer with a socratic question. How about using "(p1 + p2 + p4 + p7+ p10)/5" where p10 is the probability of winning the lottery. Why would adding p10 in the equation be better or worse? – Sextus Empiricus Oct 15 '22 at 17:28
  • @ Sextus: Great point! My answer to your question would be - the probability of winning the lottery clearly has nothing to do with having the disease. Therefore, it should have nothing to do with this? – stats_noob Oct 15 '22 at 18:15
  • I know we are not supposed to post links to other questions in the comments, but I attempted to write an R simulation corresponding to a similar problem! https://stats.stackexchange.com/questions/592407/combining-averages-to-improve-estimates – stats_noob Oct 15 '22 at 18:16
  • (I will delete the above comment if its not allowed) – stats_noob Oct 15 '22 at 18:16
  • So you believe that it makes sense that we do not add p10 to the computation. But why would make the use of p1 p2 and p4 make sense? What makes the barrier that we can add those but not p10? – Sextus Empiricus Oct 15 '22 at 18:21
  • Because in this case, p1, p2 and p4 in theory are more related to the probability of having the disease compared to the probability of winning the lottery. – stats_noob Oct 15 '22 at 18:23
  • "more related to" Ok, but how do you quantify that, what makes exactly the difference? What about using p11, the probability for a young man to get another disease, would that make more or less sense than p1, p2 and p4? – Sextus Empiricus Oct 15 '22 at 18:26
  • At this point, my logic would tell me that we would need to speak to subject matter experts who can provide us advice as to which variables might influence the probability of having this specific disease. if the experts were to tell us that "p11" might be relevant - I would say that it might be useful and we could consider its effect in the average estimate. However, I am not knowledgeable about statistics and my logic could be wrong. – stats_noob Oct 15 '22 at 18:28
  • So can we conclude that without subject matter experts, saying that (p1 + p2 + p4 + p7)/4 makes no sense? – Sextus Empiricus Oct 15 '22 at 18:31
  • Well in theory yes ... if the subject matter experts tell us that age and gender have no influence on the disease ....then none of these probabilities are relevant and should not be averaged. – stats_noob Oct 15 '22 at 18:33
  • The other way around. When age and gender have influence, then averaging should not be done. – Sextus Empiricus Oct 15 '22 at 18:35
  • "When age and gender have influence, then averaging should not be done." - is this because of confounding of variable effects? – stats_noob Oct 15 '22 at 18:37
  • If age or gender have an influence then the $p_i$ will be different and you should use $p_7$ and not average with the others. Only when the noise of $p_7$ is such large and the bias of the other $p_i$ is expected to be small (to make that assessment to you would need to have subject matter information) then you could use the other $p_i$ instead of $p_7$. – Sextus Empiricus Oct 15 '22 at 19:36
  • In the example age has an influence, should you average p6 and p7? – Sextus Empiricus Oct 15 '22 at 19:42
  • The more I talk, the more I feel that my lack of knowledge in statistics is digging myself deeper into a hole .... I would say that no, p6 and p7 should not be averaged. We are trying to predict the probability of this new YOUNG man having a disease. P6 is for OLD men, therefore p6 should not be averaged. – stats_noob Oct 15 '22 at 19:49
  • I would like to greatly thank you for taking time out of your Saturday to educate me on statistics. I have been reading the whole day on the internet to try and see if my understanding of this is correct https://stats.stackexchange.com/questions/592352/is-there-such-a-thing-as-a-longitudinal-variable - could you please take a look at it if you have time (I will then delete this comment) – stats_noob Oct 15 '22 at 19:50