1

In his celebrated talk on randomness and pseudorandomness

https://youtu.be/Jz1UoAWD80Q?t=366

legendary mathematician Avi Wigderson makes the powerful statement that sampling is perhaps the most important use of randomness.

He skips over this rather elementary slide quickly but I am trying to understand the numbers behind it, using weak law of large numbers or Chebyhsev's inequality but I am not sure if I am getting it.

He suggests 2000 (i.i.d) samples would predict the result of an election in a 280 million population with +/-2 percent with 99% probability, he relates this to the weak law of large numbers but I can't see a concrete connection.

1/Sqrt(2000) is about 2% - which is about another thing that I never understand: why does error in direct sampling go as 1/Sqrt(N)? My feeling is error in the standard variation should go as 1/Sqrt(N) but error in the mean should scale as 1/N.

Apologies in advance, if these are too simple questions!

anon248
  • 111
  • 3
  • 1
    For some intuition (but not a full answer), see https://stats.stackexchange.com/questions/3734. – whuber May 16 '22 at 12:46

1 Answers1

1

In choosing 2000 at random from 280,000,000 we can ignore the difference between sampling without and sampling with replacement. So, we are essentially interested in the binomial proportion $p$ of voters for Candidate A in an election contest between two candidates.

A Wald confidence interval for is based on the number $x$ in favor of A out of $n$ potential voters interviewed. We estimate $p$ by $\hat p = x/n.$ The standard deviation of $\hat p$ is $\sqrt{p(1-p)/n}$ is called the standard error of $\hat p.$ Note that $0 \le p \le 1.$ The standard error is largest when $p = 1/2.$ The standard error itself can be estimated by $\sqrt{\hat p(1-\hat p)/n}.$

enter image description here

Assuming that $Z = \frac{p - \hat p}{\sqrt{\hat p(1-\hat p)/n}}$ is approximately distributed as standard normal, one has $P(-1.96 < Z < 1.96) = 0.95.$ Accordingly, one can say that an approximate 95% confidence interval for $p$ is of the form $\hat p \pm 1.96\sqrt{\frac{\hat p(1-\hat p)}{n}}.$ This is the Wald confidence interval and it works reasonably well as long as $n > 500$ or so.

The corresponding 99% Wald CI is of the form $\hat p \pm 2.576\sqrt{\frac{\hat p(1-\hat p)}{n}}.$ If we interview about $n = 2000$ randomly selected potential voters, then

$$2.576\sqrt{\frac{\hat p(1-\hat p)}{n}}\le 2.576\sqrt{\frac{(1/2)(1/2)}{2000}} = 2.576/\sqrt{8000} \approx 0.029,$$

and smaller if $p \ne 1/2;$ still smaller for a 95% CI.$

BruceET
  • 56,185
  • By appealing to a formula for a "Wald confidence interval" you basically are saying "trust us, this is the right behavior," but you don't actually answer the question, which seeks some intuition -- that is, real understanding -- for appreciating why the formula depends on $n^{-1/2}$ and not some other way. – whuber May 16 '22 at 12:48
  • "The standard error itself can be estimated by $\sqrt{\hat p(1−\hat p)/n}.$ [Figure.] Assuming that $Z = \frac{p - \hat p}{\sqrt{\hat p(1-\hat p)/n}}$ is approximately distributed as standard normal, one has $P(-1.96 < Z < 1.96) = 0.95\dots$" – BruceET May 16 '22 at 14:46