6

I have a large population of size $n$ from an unknown continuous random variable $X$, and I do not know the underlying distribution of $X$. Given a constant number $c$, I want to determine the minimum sample size I need to estimate the probability $P(X \le c)$ given a confidence level, $p_c$, and confidence interval, $I_c$ (I am not sure if we need them! ). How can I find the minimum sample size to estimate this probability?

I have found the following discussion in Wikipedia which is independent of the number of population. I am not sure if it is a good way to determine sample size! enter image description here

I have also found some methods to determine sample size for data to be analyzed by nonparametric tests.you don't have to make any assumption about the distribution of the values. That is why it is called nonparametric. Now I am confused if these nonparametric methods can be used to solve my problem or the method I found in Wikipedia is the correct way to solve my problem, or there exists a better solution.

Thanks for your help.

Alex
  • 163
  • 2
    There is a substantial difference between the answer to this question for a single value of $c$ and an answer that is valid for more than one value. Which application do you have in mind? – whuber Oct 13 '13 at 17:27
  • a single value of c. I edited the question. – Alex Oct 13 '13 at 17:28
  • 3
    OK, that's easy. For the record, the solution for an arbitrary number of unspecified $c$ is given at http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Setting_confidence_limits_for_the_shape_of_a_distribution_function. – whuber Oct 13 '13 at 17:30
  • Are you sure you want $P(X\leq c)$ and not a statistic of your sample such as $P(\bar{X}\leq c)$? – bdeonovic Oct 13 '13 at 17:48
  • Yes I want to find $P(X≤c)$ – Alex Oct 13 '13 at 17:54
  • Count the sample values $x_i$ such that $x_i \leq c$ and consider a confidence interval about the binomial propotion $\theta=\Pr(X \leq c)$. – Stéphane Laurent Oct 13 '13 at 18:16
  • 1
    @StéphaneLaurent I am a little confused. My question is how many samples, i.e. $x_i's$, I should choose. What is the binomial propotion? Would you please explain a little more? – Alex Oct 13 '13 at 18:43
  • @whuber Thanks for your help. Knowing the value c, how can I find the required number of sample? – Alex Oct 13 '13 at 18:45
  • What is your criterion ? What I had in mind is for example to find $n$ such that the length of the confidence interval is below a prespecified maximal length. – Stéphane Laurent Oct 14 '13 at 19:09
  • 1
    About your update: yes, this is what I said, if you count the sample values $x_i$ such that $x_i \leq c$ then this count has a binomial distribution. – Stéphane Laurent Oct 14 '13 at 20:14
  • 1
    I suggest you take a look at inverse binomial sampling. This is a sequential method that adaptively selects sample size to guarantee a certain confidence level for a prescribed relative confidence interval. So, for example, this method can assure that the estimated probability does not deviate from the true probability by more than, say, 10% with 95% confidence. Take a look at an explanation here (see especially the last reference): http://stats.stackexchange.com/questions/71164/monte-carlo-estimation-of-probabilities/71228#71228 – Luis Mendo Oct 14 '13 at 20:55

1 Answers1

3

The Dvoretzky-Kiefer-Wolfowitz inequality can be used here. The required sample size $b$ (I'm using $b$ to distinguish it from $n$ because you already set your population size as $n$ in the problem statement) is determined by $$b \geq \left( {1 \over 2 \epsilon^2 } \right) \mathrm{ln} \left( {2 \over \alpha} \right),$$ where $\epsilon$ is how close you want your empirical cdf to be and $1-\alpha$ is the confidence level.

So, for example, if you want to estimate $F(c)$ within $\epsilon = 0.01$ with 95% confidence, the formula gives a sample size of $$b \geq 18444.4,$$ or $b = 18445.$

This will cover any and all $c,$ so it is possible you can do much better. Perhaps one of the commenters will fill in the details on a more efficient solution for a single value of $c.$

soakley
  • 4,516
  • 1
    Thanks a lot! Would you please introduce a reference for this inequality? – Alex Oct 14 '13 at 20:04
  • Thanks for the reference, but I cannot find α in the formula. – Alex Oct 15 '13 at 15:11
  • 1
    The 3rd equation on the reference page says that with a sample of size $b$ the probability that you will be more than $\epsilon$ away from the true cdf is $\leq 2e^{-2b \epsilon^2}.$ For your application just set this probability to be $\leq \alpha$ (where your $\alpha$ will be very small, since you want to be highly likely to be close to the true cdf) and then solve for $b.$ – soakley Oct 15 '13 at 18:46
  • Thanks. Sorry I am not good at statistic, I have another question; Can we say ϵ is confidence interval? – Alex Oct 16 '13 at 08:47
  • 1
    Say it this way: $\epsilon$ is the half-width of the confidence interval. Since your empirical cdf will be within $\epsilon$ with probability $1 - \alpha,$ the confidence interval has width $2 \epsilon.$ Here is another way to look at it: If $G(c)$ is your empirical cdf at $c,$ the confidence interval is $[G(c) - \epsilon, G(c) + \epsilon].$ – soakley Oct 16 '13 at 13:34
  • I have another simple question and I would appreciate if you could help me. suppose that I know the sample size b and I want to select b samples from a large population n. should this selection be random or sis there a better way to do this? – Alex Dec 23 '13 at 17:19
  • What is your purpose? If the sampling is not random, how will it be done? – soakley Dec 26 '13 at 18:18
  • I found different methods for sampling http://en.wikipedia.org/wiki/Sampling_(statistics)#Sampling_methods – Alex Dec 26 '13 at 19:35