1

I have a probability distribution, that in its tail follows a power law. I've noticed, while I was simulating samples, and determining parameters experimentally, that as I increase the value of a percentile I want to measure experimentally, the percentile converges ever so slowly. For instance the median is approximated within 2% after 100 samples, the 75% percentile requires about 500 samples, and the 95% percentile requires several thousand samples. I imagine there is a way to determine the distribution of the percentile error, and I was trying to use the methods used by Newman (2005) to derive a formula, but I'm not really getting anywhere on my own. Are there any?

Reference Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5), 323–351. https://doi.org/10.1080/00107510500052444

gciriani
  • 287
  • 1
    This can be (easily) analyzed using the methods shown for medians at https://stats.stackexchange.com/questions/45124. – whuber Mar 09 '22 at 16:30
  • @kjetil_b_halvorsen, which method? There are several, and seems to me that most of those methods invoke the mean which is undefined for my distribution (Cauchy or Half Cauchy). One method also states that for extreme percentiles the distribution of the percentile does not approach a normal distribution. Any help appreciated. – gciriani Mar 09 '22 at 18:37
  • Please look more closely at the answers in the thread I linked to: they suffer from none of the problems you mention. – whuber Mar 09 '22 at 18:42
  • @whuber, thanks for prodding me. I get it that a percentile is distributed like a binomial, thus the variance of quantile q, is q(1-q)/n^2 when expressed as a percentage. However, I do not understand why that percentage, in the 1st answer provided, is divided by the square of the value of the pdf at the quantile and not multiplied by the square of the value. Sorry to both you and kjetil_b_halvorsen for mixing up your ids in my first reply. – gciriani Mar 09 '22 at 20:45
  • When there are many highly upvoted answers to a question, it's useful to read them all. My answer in the link explains why it must be divided and not multiplied. The answer by Alecos Papadopulos gives a mathematical derivation. – whuber Mar 09 '22 at 21:49
  • 1
    @whuber, It's a very involved explanation, but I think now I understand why one has to divide by the pdf and not multiply to obtain the st.dev. I will answer my own question in a simplified way, if you prefer not to, and use your link to send contributors to your detailed explanation. Thanks. – gciriani Mar 09 '22 at 22:49
  • I agree my post is long and involved. However, the reasoning leading to division is given in full after the second figure, early on, and it's unnecessary to read beyond that. Begin at "Now consider a box with a more complicated shape." I also offer an abbreviated mathematical derivation at the very end in the "Asymptotic Results" section. – whuber Mar 10 '22 at 00:40

1 Answers1

2

After several comments linking to detailed explanations for a separate question, here is a quick summary.

Any event that has a probability p of happening, follows a binomial distribution, and so do percentiles or quantiles. The variance of a binomial is $n p (1-p)$ for the number of data points included on either side of the percentile, therefore the standard deviation is $\sigma = \sqrt{ n p(1-p)}\ $. If we want the standard deviation as proportion, it is $\sigma = \sqrt{ p(1-p)/n}\ $. Because this proportion represents an area of variation under the probability density function PDF, and because the PDF calculated at percentile p, $PDF(p)$, represents the height of that area, we can see that the width of that area will be $\sigma /PDF(p) $, which is the standard deviation of the position of the percentile.

As noted in one of the answers linked above, for extreme values of percentiles, the binomial distribution departs significantly from a normal distribution; but keeping that in mind it will not be too difficult to calculate asymmetric confidence intervals.

In particular for the distribution I was dealing with (Cauchy), which has a left tail asymptotic to a power law (exponent -1), to obtain approximately the same standard deviation of 0.09, for the median $=0\pm0.09$, 300 points are required, for the 75% percentile $=1\pm0.09$, 900 points are required, and for the 95% percentile$=6.3\pm0.09$, 90,000 points are required.

I would like to thank Whuber for leading me through the path to get to the right answer.

gciriani
  • 287