14

I'm an absolute beginner in statistics. Please excuse any wrong assumptions or missing information in my question. I have a question that relates to a multinomial distribution (not even 100% sure about this) that I hope somebody can help me with. If I take a sample (lets assume $n=400$) on a categorical variable that has more than two possible outcomes (e.g. blue, black, green, yellow) and plot the frequencies so that I can derive the probabilities. E.g.: black 10% blue 25% green 35% yellow 30%

How could I compute the 95% confidence interval for those probabilities? And how could I determine the sample size required in order to get an accurate result within +-3% for each probability? Please let me know if the answer to the question requires any additional information.

Andy
  • 19,098
Dirk
  • 311
  • 2
    Welcome to the website, you may want to do a search on maximum likelihood estimation and standard error, this link may be a good start. P.S: Although they are talking about a different distribution (Pareto) in the link, the concepts apply to your case. – Zhubarb Aug 10 '14 at 14:00
  • 1
    Also check this out: http://sites.stat.psu.edu/~sesa/stat504/Lecture/lec3_4up.pdf – Zhubarb Aug 10 '14 at 14:05
  • Would you know how to do it if you got only two categories instead of four? – Michael M Aug 10 '14 at 14:12
  • Hi Zhubarb, thanks a lot for the links I will read through them and try to follow the instructions. – Dirk Aug 10 '14 at 15:56
  • 1
    Hi Michael, I think in this case it could work with a binomial distribution and I would use the normal distribution (since it's approximately the same) to calculate the confidence interval. Please correct me if I'm wrong. – Dirk Aug 10 '14 at 16:06
  • 2
    Then you can simply do this for each category separately (e.g. black vs. non-black). – Michael M Aug 10 '14 at 17:04

3 Answers3

7

Thank you very much again for your help. Below is the (hopefully correct) solution using the "Normal Approximation Method" of the Binomial Confidence Interval:

enter image description here

Dirk
  • 311
  • 1
    For the sample size in binomial designs, sometimes one uses the 'worst' case p = 0.5 because usually the proportions are not known in advance. Further note that there are slightly better methods that the simple z approximation, e.g. Wilson's method. But your solution looks nice anyway. – Michael M Aug 11 '14 at 07:29
  • Hi Michael, thanks a lot for the additional tips and the confirmation. Really appreciated. – Dirk Aug 11 '14 at 11:09
  • I think there might be a mistake in the equation in the spreadsheet. In the standard error (s.e.) you need to divide by the overall sample size (n=400), not the number of realizations for each category. –  Dec 26 '15 at 17:15
4

I would like to add Wilson's method mentioned by Michael M in a comment.
From Wikipedia: Binomial proportion confidence interval - Wilson_score_interval.
You can get a 95% confidence interval by using the following:

$\frac{n_s + \frac{z^2}{2}}{n+z^2} \pm \frac{z}{n+z^2}\sqrt{\frac{n_s n_f}{n}+\frac{z^2}{4}}$

The left term is the center value and the right term gives the value you have to add / subtract to get the interval bounds.

$n_s$ is the number of samples in that category, $n_f$ the number of samples not in that category, $n$ the total number of samples and $z$ is 1.96 if you want a 95% confidence interval*

For high counts it gives the same results of the normal approximation, however this should be better for low counts or extreme values.

As an example, I had a category with 0 samples and the normal approximation returned a 0 s.e., and thus an absured confidence interval of 0-0 (as it was certain that it has 0% probability, while actually it had 0 occurence only because of the few samples).

$ $

* The method is actually for a binomial distribution and $n_s$ and $n_f$ are successes and failures on that distribution. However, I think it can be reasonably used for a multinomial, even though it does not account for the fact that estimated probabilities must sum up to 1. The normal approximation doesn't account for that either afaik.

BlueCoder
  • 141
0

There are several methods to get confidence intervals for multinomial proportions, and many of them are implemented in R's function MultinomCI() from the DescTools package:

> DescTools::MultinomCI(400 * c(0.10, 0.25, 0.35, 0.30), sides = "two.sided", method = "sisonglaz")
      est lwr.ci    upr.ci
[1,] 0.10   0.05 0.1549989
[2,] 0.25   0.20 0.3049989
[3,] 0.35   0.30 0.4049989
[4,] 0.30   0.25 0.3549989

For details, see https://cran.r-project.org/web/packages/DescTools/DescTools.pdf

schotti
  • 501