I am an MBA Student taking courses in statistics.
We had this discussion today in our class. Suppose we have a large dataset that contains the "Age Group" and "Gender" of people within a country and if they have some specific "Disease or Not".
Using this data, by filtering on different subsets, we calculate that:
- Suppose the probability of having a specific disease in the entire population is p1
- Suppose the probability of having a specific disease for the population of men is p2
- Suppose the probability of having a specific disease for the population of women is p3
- Suppose the probability of having a specific disease for the population of young people is p4
- Suppose the probability of having a specific disease for the population of old people is p5
- Suppose the probability of having a specific disease for the population of old men is p6
- Suppose the probability of having a specific disease for the population of young men is p7
- Suppose the probability of having a specific disease for the population of old women is p8
- Suppose the probability of having a specific disease for the population of young women is p9
Now, imagine we realize that we forgot to collect the data for this one young man. What is the probability that he has this disease?
The obvious answer to this question seems like the probability is p7. But after thinking about this for a while, I had the following ideas:
What if the probability of having this disease is significantly more influenced by some variable that is not measured in this dataset (e.g. smoking)?
What if some of these subsets have a very small population? (e.g. suppose only 1% of all data is old men)
Suppose young men smoke a lot but this forgotten young man does not smoke - the probability of him having a disease might be closer to that of some other subgroup?
For such cases, I thought that it might be better to try and "average out" this out information to safeguard against possible risks. For example - perhaps the probability that this young man has the disease is: (p1 + p2 + p4 + p7)/4 ?
This way, we have taken into consideration hidden trends and patterns within the entire population, the entire population of young people and the entire population of young men. The effects and skewness from possible outliers should now have been more smoothened out?
From my Algebra classes, I know that there is no guarantee that p7 must equal to (p1 + p2 + p4 + p7)/4. This brings me to my question:
In the absence of any other information and prior knowledge on this disease - is it possible that (p1 + p2 + p4 + p7)/4 might be a "less riskier" estimate compared to p7 alone?
Is this a valid estimation procedure, or have I misunderstood how averages and pooled averages are intended to be interpreted and used?
Note 1: Of course the drawback to this pooled approach is that if there are no outliers in the data, then p7 would have been a more accurate estimate and we put ourselves in our a worse position by pooling the averages.
Note 2: If we did in fact have some prior knowledge, it could have been interesting to take a "weighted pooled average" of the probabilities. But in the absence of prior knowledge and additional information, it looks like we have no choice but to assume equal weighting.