I am an MBA student taking courses in statistics.
Recently, I posted a question (Can "Pooled Averages" be considered as "Better" compared to "Individual Averages"?) in which I asked if averaging different averages together can improve the quality of an estimate.
Now, I have a slightly modified version of this question.
Suppose we are only given information on whether different people in the population have a specific disease (i.e. we are not provided with a dataset, only provided with summary information):
- Suppose the probability of having a specific disease in the entire population is p1
- Suppose the probability of having a specific disease for the population of men is p2
- Suppose the probability of having a specific disease for the population of women is p3
- Suppose the probability of having a specific disease for the population of young people is p4
- Suppose the probability of having a specific disease for the population of old people is p5
Now, imagine we realize that we forgot to collect the data for this one young man. What is the probability that he has this disease?
I see several different methods of estimating this probability:
- Method 1 (Overall Effect): p1
- Method 2 (Gender Effect): p2
- Method 3 (Age Effect): p4
- Method 4 (Average Age and Gender Effect): (p2 + p4)/2
- Method 5 (Average Overall, Age and Gender Effect) : (p1 + p2 + p4)/3
- Method 6 (Average Overall and Age Effect): (p1 + p4)/2
- Method 7 (Average Overall and Gender Effect): (p1 + p2)/2
I have the following question: Using statistics, do we have any analysis that can compare the validity of these methods? For example, perhaps some of these methods are meaningless as they double count and introduce noise into the estimates? Or perhaps some of these methods will result in very large standard errors?
I have been having this discussion with my classmates and different people think its either Method 1, Method 4 or Method 5. But is there any way to formally establish which of these methods is the best?
EXTRA: I spent some thinking about a simulation experiment in R to see which Method is better. Imagine you have a dataset that contains information on the disease, age and gender of patients:
### DATA FOR PROBLEM
Disease = 1, No Disease = 0
Disease <- c(1,0)
Disease <- sample(Disease, 10000, replace=TRUE, prob=c(0.3, 0.7))
Male = 1, Female = 0
Gender <- c(1,0)
Gender <- sample(Gender, 10000, replace=TRUE, prob=c(0.5, 0.5))
Old = 1, Young = 0
Age <- c(1,0)
Age <- sample(Age, 10000, replace=TRUE, prob=c(0.5, 0.5))
Patient_ID = 1:1000
Simulation_Data = data.frame(Patient_ID, Disease, Gender, Age)
Now, imagine that a new patient enters with a randomly assigned gender, age and disease status. We can now compare which of these methods will provide a better estimate by running this simulation many times:
#### LOOP
results = list()
for (i in 1:1000)
{
SIMULATE DATA FOR A NEW PATIENT
Patient_Being_Tested_Disease_i = ifelse( runif(1, 0, 1) > 0.5, 1,0)
Patient_Being_Tested_Gender_i = ifelse( runif(1, 0, 1) > 0.5, 1,0)
Patient_Being_Tested_Age_i = ifelse( runif(1, 0, 1) > 0.5, 1,0)
Overall_Prob = mean(Simulation_Data$Disease)
Patient_Gender_Data_i = Simulation_Data[which( Simulation_Data$Gender == Patient_Being_Tested_Gender_i), ]
Patient_Gender_Prob_i = mean(Patient_Gender_Data_i$Disease)
Patient_Age_Data_i = Simulation_Data[which( Simulation_Data$Age == Patient_Being_Tested_Age_i), ]
Patient_Age_Prob_i = mean(Patient_Age_Data_i$Disease)
Method_1_i = Overall_Prob
Method_2_i = mean(Patient_Gender_Data_i$Disease)
Method_3_i = mean(Patient_Age_Data_i$Disease)
Method_4_i = (Patient_Gender_Prob_i + Patient_Age_Prob_i)/2
Method_5_i = (Overall_Prob + Patient_Gender_Prob_i + Patient_Age_Prob_i)/3
Method_6_i = (Overall_Prob + Method_3_i)/2
Method_7_i = (Overall_Prob + Method_2_i)/2
methods_i <- c(Method_1 = Method_1_i , Method_2 = Method_2_i, Method_3 = Method_3_i,
Method_4 = Method_4_i ,
Method_5 = Method_5_i, Method_6 = Method_6_i , Method_7 = Method_7_i)
winner_i = ifelse(Patient_Being_Tested_Disease_i == 0, names(methods_i)[which.min(methods_i)], names(methods_i)[which.max(methods_i)])
print(winner_i)
results_i = data.frame(i, winner_i)
results[[i]] <- results_i
}
final <- do.call(rbind.data.frame, results)
counts <- table(final$winner_i)
barplot(counts, main="Which Method Won?",
xlab="Method", ylab = "Number of Wins")
My logic is - in the case when the patient has the disease, you want to select the Method with the largest probability. And in the case when the patient does not have the disease, you want to select the method with the smallest probability.
I hope I did this correctly!