Combining Averages to Improve Estimates?

Question

I am an MBA student taking courses in statistics.

Recently, I posted a question (Can "Pooled Averages" be considered as "Better" compared to "Individual Averages"?) in which I asked if averaging different averages together can improve the quality of an estimate.

Now, I have a slightly modified version of this question.

Suppose we are only given information on whether different people in the population have a specific disease (i.e. we are not provided with a dataset, only provided with summary information):

Suppose the probability of having a specific disease in the entire population is p1
Suppose the probability of having a specific disease for the population of men is p2
Suppose the probability of having a specific disease for the population of women is p3
Suppose the probability of having a specific disease for the population of young people is p4
Suppose the probability of having a specific disease for the population of old people is p5

Now, imagine we realize that we forgot to collect the data for this one young man. What is the probability that he has this disease?

I see several different methods of estimating this probability:

Method 1 (Overall Effect): p1
Method 2 (Gender Effect): p2
Method 3 (Age Effect): p4
Method 4 (Average Age and Gender Effect): (p2 + p4)/2
Method 5 (Average Overall, Age and Gender Effect) : (p1 + p2 + p4)/3
Method 6 (Average Overall and Age Effect): (p1 + p4)/2
Method 7 (Average Overall and Gender Effect): (p1 + p2)/2

I have the following question: Using statistics, do we have any analysis that can compare the validity of these methods? For example, perhaps some of these methods are meaningless as they double count and introduce noise into the estimates? Or perhaps some of these methods will result in very large standard errors?

I have been having this discussion with my classmates and different people think its either Method 1, Method 4 or Method 5. But is there any way to formally establish which of these methods is the best?

EXTRA: I spent some thinking about a simulation experiment in R to see which Method is better. Imagine you have a dataset that contains information on the disease, age and gender of patients:

### DATA FOR PROBLEM
Disease = 1, No Disease = 0
Disease <- c(1,0)
Disease <- sample(Disease, 10000, replace=TRUE, prob=c(0.3, 0.7))
Male = 1, Female = 0
Gender <-  c(1,0)
Gender <- sample(Gender, 10000, replace=TRUE, prob=c(0.5, 0.5))
Old = 1, Young = 0
Age <-  c(1,0)
Age <- sample(Age, 10000, replace=TRUE, prob=c(0.5, 0.5))
Patient_ID = 1:1000
Simulation_Data = data.frame(Patient_ID, Disease, Gender, Age)

Now, imagine that a new patient enters with a randomly assigned gender, age and disease status. We can now compare which of these methods will provide a better estimate by running this simulation many times:

   #### LOOP
results = list()
for (i in 1:1000)
{
SIMULATE DATA FOR A NEW PATIENT
Patient_Being_Tested_Disease_i = ifelse( runif(1, 0, 1) > 0.5, 1,0)
Patient_Being_Tested_Gender_i = ifelse( runif(1, 0, 1) > 0.5, 1,0)
Patient_Being_Tested_Age_i = ifelse( runif(1, 0, 1) > 0.5, 1,0)
Overall_Prob = mean(Simulation_Data$Disease)
Patient_Gender_Data_i = Simulation_Data[which( Simulation_Data$Gender == Patient_Being_Tested_Gender_i), ]
Patient_Gender_Prob_i = mean(Patient_Gender_Data_i$Disease)
Patient_Age_Data_i = Simulation_Data[which( Simulation_Data$Age == Patient_Being_Tested_Age_i), ]
Patient_Age_Prob_i = mean(Patient_Age_Data_i$Disease)
Method_1_i = Overall_Prob
Method_2_i =  mean(Patient_Gender_Data_i$Disease)
Method_3_i = mean(Patient_Age_Data_i$Disease)
Method_4_i = (Patient_Gender_Prob_i + Patient_Age_Prob_i)/2
Method_5_i = (Overall_Prob + Patient_Gender_Prob_i + Patient_Age_Prob_i)/3
Method_6_i = (Overall_Prob + Method_3_i)/2
Method_7_i = (Overall_Prob + Method_2_i)/2
methods_i <- c(Method_1 = Method_1_i   , Method_2 = Method_2_i, Method_3 = Method_3_i,
             Method_4 = Method_4_i ,
             Method_5 = Method_5_i, Method_6 =  Method_6_i , Method_7 =  Method_7_i)
winner_i = ifelse(Patient_Being_Tested_Disease_i == 0,  names(methods_i)[which.min(methods_i)], names(methods_i)[which.max(methods_i)])
print(winner_i)
results_i = data.frame(i, winner_i)
results[[i]] <- results_i
}
final <- do.call(rbind.data.frame, results)
counts <- table(final$winner_i)
barplot(counts, main="Which Method Won?",
   xlab="Method", ylab = "Number of Wins")

My logic is - in the case when the patient has the disease, you want to select the Method with the largest probability. And in the case when the patient does not have the disease, you want to select the method with the smallest probability.

I hope I did this correctly!

It is a bit unclear what you mean with by the combination of these multiple effects. You might be talking about a linear model, but given your recent post it seems like you are averaging the effects. In relation to a linear model it might be interesting to read about the piranha problem; you can not generally add up multiple effects. — Sextus Empiricus, Oct 15 '22 at 18:18
Would this approach that I have done be similar in theory to the spirit of a linear model? — stats_noob, Oct 15 '22 at 18:23
No averaging would be different. In a linear model you would be adding multiple effects and two combined effects would be ending up as bigger (adding up), instead of being averaged. — Sextus Empiricus, Oct 15 '22 at 18:29

Combining Averages to Improve Estimates?

Disease = 1, No Disease = 0

Male = 1, Female = 0

Old = 1, Young = 0

SIMULATE DATA FOR A NEW PATIENT

0 Answers0

Linked