1

I am taking a course in Statistics and our prof showed us a real survey that was recently in the news. In a country with a population of roughly 30-40 million people, only 1000 people were asked about their opinion on a controversial political question (e.g. should more money be invested in healthcare?). The results indicated that "yes" - and the headline of the news article was "majority of citizens support increasing investments in healthcare".

Some information about the methodology used in the survey was provided - allegedly, this survey took into consideration random sampling proportional to key demographics identified within the census. As a result, the opinion of only 0.002% of the country's population was considered sufficient to represent the entire country's opinion on this matter. –

I personally find this application of statistics to be very incredible! Suppose the "opinions" of the population have a normal distribution - an estimate generated from a random sample of 0.002% of the country gets very close to the actual average opinion (provided the average opinion is actually known)! For example, suppose we ask people "how much should healthcare spending be increased relative to the current spending (in percentage)?" (e.g. 14% more, 39% more, etc.). Suppose we assume that the response to this question has a normal distribution with a specific mean and variance (e.g. average person thinks the country should invest 50% more in healthcare) - the following simulation shows what would happen if this question was asked to 0.002% of the population (using R):

set.seed(123)

my_data = data.frame(id = 1:1000000, percent_increase_in_health_care = rnorm(1000000,50, 10))

results = list()

for (i in 1:1000)

{

sample_i = my_data[sample(nrow(my_data), 2000), ] mean_i = mean(sample_i$percent_increase_in_health_care) results[[i]] = data.frame(i, mean_i) }

> mean(final$mean_i)final = do.call(rbind.data.frame, results) [1] 50.00377

plot(hist(final$mean_i, main = "Distribution for Survey Responses of 0.002% Randomly Selected People (1000 Random Samples)"))

enter image description here

As we see, even when such a tiny percent of the population was asked for the opinion - the average opinion from this tiny sample is pretty close to the actual average opinion of the population!

I was personally curious in seeing what happens if the opinions of the country are not normally distributed and a similar percentage of the country are asked for their opinions - would the same accuracy still hold? And does this accuracy of this estimate become worse the "further" you move away from a normal distribution? (e.g. gamma, weibull, etc.)

As an example - imagine a country in which half the people think healthcare spending should be nearly doubled (i.e. 100% increase) and half the people think that healthcare spending should remain at roughly what it is (i.e. 0% increase):

set.seed(123)

my_data_1 = data.frame(id = 1:500000, percent_increase_in_health_care = rnorm(500000,5, 1)) my_data_2 = data.frame(id = 500001:1000000, percent_increase_in_health_care = rnorm(500000,90, 5))

my_data = rbind(my_data_1, my_data_2)

results = list()

for (i in 1:1000)

{

sample_i = my_data[sample(nrow(my_data), 2000), ] mean_i = mean(sample_i$percent_increase_in_health_care) results[[i]] = data.frame(i, mean_i) }

final = do.call(rbind.data.frame, results)

mean(final$mean_i) [1] 47.56409

plot(hist(final$mean_i, main = "Distribution for Survey Responses of 0.002% Randomly Selected People (1000 Random Samples)"))

enter image description here

In the end, is this true? Regardless of the underlying distribution of the response variable (e.g. percent increase in healthcare spending) - if a proportional sample is chosen, on average and over many experiments, the mean response will reflect the true mean response of the population?

stats_noob
  • 1
  • 3
  • 32
  • 105
  • You're basically discovering "design based inference". If we think of a finite population as having $N$ fixed but unknown values $y_1, \dots, y_N$ and we take a random sample $S$ of $n$ of these, we think of the randomness as coming from the sampling mechanism rather than a model for these values. So e.g. in our sample mean $\bar x = \frac 1n \sum_{i \in S} y_i$, the random thing is $S$, not the $y_i$. Then we think of the distribution of $\bar x$ over all possible samples. If we do a simple random sample $\bar x$ is unbiased and we can get confidence intervals and $\bar x \to \mu$ still! – jld Nov 15 '22 at 02:49
  • Thank you for this comment! I will look into "design based inference"! – stats_noob Nov 16 '22 at 06:33
  • Yeah for sure, hope that actually is relevant/helpful! – jld Nov 17 '22 at 14:55

0 Answers0