1

Let's say I have a population $S$, with an estimated size $\hat{n}$ (and standard error $\sigma_{\hat{n}}$). The way that $\hat{n}$ is estimated is through generating random samples from a larger sample space of size $m$ ($n \ll m$), and then determining how many belong to $S$. For our purposes, $n$ can't realistically be determined any other way. These samples form a Bernoulli distribution (since a sample either belongs to $S$ or doesn't), and we calculate $\sigma_\hat{n}$ through normal approximation.

I'd like to sample from $S$ and determine how many samples belong to $T$, on the basis of some arbitrary criteria for $s \in S$. Let the observed proportion of $S$ which are in $T$ be called $\hat{p}$, and let's say we also use a normal approximation. My question is: how does $\sigma_\hat{n}$ "interact" with $\sigma_\hat{p}$? (since we want to calculate $\hat{n}\hat{p}$)

Some notes:

  • $\hat{n}$ and $\hat{p}$ are independent. There's no relationship between the two.
  • Let's say we're initially sampling from $R$ (of known size $m$) to find $\hat{n}$. Why not instead determine directly how many $r \in T$? The reason is, verifying that some $r$ or $s$ is in $T$ is very complex (PSPACE-hard). The maximum number of samples I can realistically verify to be in $T$ is so small that $m\hat{q}$ (where $\hat{q}$ is the observed proportion of $R$ in $T$) would have confidence intervals much too large to mean anything useful. So instead, I can achieve a very confident estimate of $\hat{n}$, and then sample from $S$ instead.

Any guidance appreciated.

Potential answer: propagation of normally-distributed errors, in our case when multiplying some $\sigma_1$ by $\sigma_2$: notes

Other comments: I initially asked some pretty incomprehensible questions, and really shouldn't have been given the time. Thanks for everyone's precious time, especially BruceET and whuber's.

  • Your question is a bit vague; are you sampling to estimate a proportion, the mean of some quantity, or the population size? A very accessible place to start learning about sampling is the Penn State Stat Online Course STAT506, Sampling Theory and Methods. Good luck! – Mike Anderson Dec 27 '20 at 19:30
  • Why don't you know the actual sample size? What are you trying to find out? What do you mean by 'successful'? – BruceET Dec 27 '20 at 21:13
  • Hi Bruce, can you see the updated question? I don't know the sample size because it's not something I can easily calculate. I'm trying to find out the size of the subset of the population which is successful. And by successful, I mean that a member of the population passes the "success criteria". – Colin McDonagh Dec 27 '20 at 21:27
  • I guess this question will remain closed, but I guess in essence what I was asking is how do we multiply two confidence intervals: https://stats.stackexchange.com/questions/305382/how-do-i-calculate-the-confidence-interval-for-the-product-of-two-numbers-with – Colin McDonagh Dec 27 '20 at 22:09
  • That last comment helped me understand what you are trying to ask, so I would like to suggest that you consider editing the question to include a similar remark. It would help even more to provide more information about how $n$ and $\sigma_{\hat n}$ are estimated as well as about how you are able to obtain samples. Abstractly it's a strange situation and the description at least suggests the possibility that $\hat n$ and $\hat p$ are not independent, which may be an important consideration. – whuber Dec 28 '20 at 13:45
  • What I get is the following. You have some population with two properties: the size of the population $n$ and the fraction success in the population $p$. Your question is how to describe an estimate the size of the number of success in the population $pn$. What you have is an estimate $\hat{n}$ with some deviation (standard error?) $\sigma_\hat{n}$ and you have an estimate $\hat{p}$ based on a sample from the population. – Sextus Empiricus Dec 28 '20 at 21:16
  • This seems like you can approach this as the product of two variables for which you can express the error of the product based in the error of the individual terms. – Sextus Empiricus Dec 28 '20 at 21:19
  • Thanks for your time guys. My initial posts were totally incomprehensible which is unfair on you who give of your precious time freely. It still might be incomprehensible though, so I hold out. I've updated the question in response to the last three comments. Yeah Sextus, except that the variables are independent. – Colin McDonagh Dec 28 '20 at 22:55

2 Answers2

2

An alternative approach is to sample from the large population R of size m>n untill you have some fixed number of successes (samples from T).

The sampling is done by testing whether a sample is S and if it is S then you test whether it is T/success. (So you do not need to do all the time the costly test to see if a sample is in T)

The number of samples that you need is negative Binomial distributed and based in that you can estimate a probability $\hat{p}$ for the fraction of T and S among in R and $\hat{p}m$ will be the estimate for the size.

  • Ok interesting, thanks Sextus. Would you mind checking my logic here if you have time? I agree that checking $r \in S$ before $r \in T$ makes sense. In my case, I expect $\hat{p} \approx 10^{-3}$. The standard deviation of a binomial distribution using a normal approximation is $(\frac{p(1-p)}{N})^{1/2} \approx {(\frac{10^{-3}}{N})^{1/2}}$. I'd like to have $\sigma_{\hat{p}} \leq \frac{\hat{p}}{10}$, which means that approximately I must have at least $(\frac{10^{-3}}{N})^{1/2} = 10^{-4}$ (ignoring the 1.96 multiplier in the case of 95% CI). – Colin McDonagh Dec 29 '20 at 00:20
  • In which case $N = 10^5$, but I think the greatest $N$ I can have is $10^3$. Maybe I'll have to have a think about how I can reduce the size of $m$, thus increasing $\hat{p}$. But then again, maybe the error propagation approach would be easier – Colin McDonagh Dec 29 '20 at 00:20
  • Ah, sorry. If I can rule out most samples on the basis that $r \notin S$, such that the probability of $r \in T$ given $r \in S$ is approximately $1$, then I only need to do the harder verification of $r \in T$ for $\frac{10^5}{10^3}$... which is definitely possible :) – Colin McDonagh Dec 29 '20 at 00:26
  • Thank you Sextus! – Colin McDonagh Dec 29 '20 at 00:37
0

Suppose you are sampling college admission test scores from a large high school district. Traditionally, the district mean on this test has been 280 with a standard deviation of 25. Then an approximate 95% CI for this year's district mean would be of the form $\bar X \pm 2(25)/\sqrt{n}.$ If you want this year's CI to have margin of error of $\pm 10,$ then you have $50/\sqrt{n} \approx 50/\sqrt{n} = 10.$ So you need a sample size of $n \approx 25.$

Suppose you sample $n = 25$ observations at random from $\mathsf{Norm}(\mu=280, \sigma=25),$ to get data z as below. (Sampling and computations in R.)

set.seed(2020)
x = rnorm(25, 289, 25)
summary(x);  length(x);  sd(x)
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    234.9   267.9   282.3   284.0   299.3   326.0 
[1] 25         # sample size
[1] 22.52312   # sample SD

The t.test procedure in R, provides a 95% CI $(274.7,293.3)$ as part of its output, captured below using $-notation. The margin of error for this sample is about 9.3. [Margins of error will vary from sample to sample, depending on the sample standard deviation: for example, four additional samples of size $n=25$ gave margins of error 9.7, 12.9, 8.9, and 10.7.]

t.test(x)$conf.int
[1] 274.6643 293.2585
attr(,"conf.level")
[1] 0.95
BruceET
  • 56,185