How to estimate the third quartile of binned data?

Question

Is there any technical trick to determine the third quartile if it belongs to an open interval which contains more that one fourth of the population (so I can't close the interval and use the standard formula)?

Edit

In case I misunderstood something I will provide more or less full context. I have data arranged in a table with two columns and, say, 6 rows. With each column corresponds an interval (in the first column) and a quantity of population which "belongs" to that interval. The last interval is open and includes more than 25% of the population. All intervals (with exception of the last) have the same range.

Sample data (transposed for presentation):

Column 1: (6;8),(8;10),(10;12),(12;14),(14;16),(16;∞)
Column 2:    51,    65,     68,     82,     78,   182

The first column is to be interpreted as an income level range. The second is to be interpreted as the number of employees whose income belongs to the interval.

The standard formula I'm thinking about is $\mathbb{Q}_{3}=x_{Q_{3}}+ \frac{\frac{3N}{4}- \sum_{i=1}^{k-1}n_{i}}{n_{Q_{3}}}r_{Q_{3}}$.

A common assumption when trying to estimate quantiles with binned data is to assume uniformity within bins. But when you know something about the way the data is likely to be distributed (as with incomes, which are right skew) assumptions that reflect that knowledge will tend to be better. Another alternative would be to assume that it's smooth, and then smooth the data (whether by KDE or some fitted distribution), redistribute points within bins according to the model [& possibly re-estimate (in somewhat EM-like fashion) the fit, &redistribute in bins again] then estimate quantiles from that. — Glen_b, Apr 27 '14 at 03:48
hi @Glen_b in the assumption of uniformity within bins. Do you think I can add a uniform jitter (with the bin width) and then run Shapiro-Wilk test for normality ? — frhack, Jun 24 '22 at 09:05
I wouldn't; you don't have it in that situation, since a normal distribution is not piecewise uniform. Why test a null that is certainly false? Presumably you intend to find out something else instead of whether something constructed to be non-normal is exactly normal. — Glen_b, Jun 24 '22 at 17:22

whuber · Accepted Answer · 2022-06-30T17:07:14.973

You need to fit these binned data with some distributional model, for that is the only way to extrapolate into the upper quartile.

A model

By definition, such a model is given by a cadlag function $F$ rising from $0$ to $1$. The probability it assigns to any interval $(a,b]$ is $F(b)-F(a)$. To make the fit, you need to posit a family of possible functions indexed by a (vector) parameter $\theta$, $\{F_\theta\}$. Assuming that the sample summarizes a collection of people chosen randomly and independently from a population described by some specific (but unknown) $F_\theta$, the probability of the sample (or likelihood, $L$) is the product of the individual probabilities. In the example, it would equal

$$L(\theta) = (F_\theta(8) - F_\theta(6))^{51} (F_\theta(10) - F_\theta(8))^{65} \cdots (F_\theta(\infty) - F_\theta(16))^{182}$$

because $51$ of the people have associated probabilities $F_\theta(8) - F_\theta(6)$, $65$ have probabilities $F_\theta(10) - F_\theta(8)$, and so on.

Fitting the model to the data

The Maximum Likelihood estimate of $\theta$ is a value which maximizes $L$ (or, equivalently, the logarithm of $L$).

Income distributions are often modeled by lognormal distributions (see, for example, http://gdrs.sourceforge.net/docs/PoleStar_TechNote_4.pdf). Writing $\theta = (\mu,\sigma)$, the family of lognormal distributions is

$$F_{(\mu, \sigma)}(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{(\log(x)-\mu)/\sigma} \exp(-t^2/2) dt.$$

For this family (and many others) it is straightforward to optimize $L$ numerically. For instance, in R we would write a function to compute $\log(L(\theta))$ and then optimize it, because the maximum of $\log(L)$ coincides with the maximum of $L$ itself and (usually) $\log(L)$ is simpler to calculate and numerically more stable to work with:

logL <- function(thresh, pop, mu, sigma) {
  l <- function(x1, x2) ifelse(is.na(x2), 1, pnorm(log(x2), mean=mu, sd=sigma)) 
                        - pnorm(log(x1), mean=mu, sd=sigma)
  logl <- function(n, x1, x2)  n * log(l(x1, x2))
  sum(mapply(logl, pop, thresh, c(thresh[-1], NA)))
}
thresh <- c(6,8,10,12,14,16)
pop <- c(51,65,68,82,78,182)
fit <- optim(c(0,1), function(theta) -logL(thresh, pop, theta[1], theta[2]))

The solution in this example is $\theta = (\mu,\sigma)=(2.620945, 0.379682)$, found in the value fit$par.

Checking model assumptions

We need at least to check how well this conforms to the assumed lognormality, so we write a function to compute $F$:

predict <- function(a, b, mu, sigma, n) {
  n * ( ifelse(is.na(b), 1, pnorm(log(b), mean=mu, sd=sigma)) 
        - pnorm(log(a), mean=mu, sd=sigma) )

It is applied to the data to obtain the fitted or "predicted" bin populations:

pred <- mapply(function(a,b) predict(a,b,fit$par[1], fit$par[2], sum(pop)), 
               thresh, c(thresh[-1], NA))

We can draw histograms of the data and the prediction to compare them visually, shown in the first row of these plots:

Histograms

To compare them, we can compute a chi-squared statistic. This is usually referred to a chi-squared distribution to assess significance:

chisq <- sum((pred-pop)^2 / pred)
df <- length(pop) - 2 - 1
pchisq(chisq, df, lower.tail=FALSE)

The "p-value" of $0.0035$ is small enough to make many people feel the fit isn't good. Looking at the plots, the problem evidently focuses in the lowest $6-8$ bin. Perhaps the lower terminus should have been zero? If, in an exploratory fashion, we were to reduce the $6$ to anything less than $3$, we would obtain the fit shown in the bottom row of plots. The chi-squared p-value is now $0.26$, indicating (hypothetically, because we're purely in an exploratory mode now) that this statistic finds no significant difference between the data and the fit.

Using the fit to estimate quantiles

If we accept, then, that (1) the incomes are approximately lognormally distributed and (2) the lower limit of the incomes is less than $6$ (say $3$), then the maximum likelihood estimate is $(\mu, \sigma)$ = $(2.620334, 0.405454)$. Using these parameters we can invert $F$ to obtain the $75^{\text{th}}$ percentile:

exp(qnorm(.75, mean=fit$par[1], sd=fit$par[2]))

The value is $18.06$. (Had we not changed the lower limit of the first bin from $6$ to $3$, we would have obtained instead $17.76$.)

These procedures and this code can be applied in general. The theory of maximum likelihood can be further exploited to compute a confidence interval around the third quartile, if that is of interest.

Wow, thanks! I must admit I didn't expect such a advanced (at least for me) machinery to be used to find solution. — atad, Aug 22 '12 at 21:09
The machinery does not have to be advanced or sophisticated, but whatever you do ought to follow the same general lines of this example: assume something about the income distribution, use that to fit a mathematical model, check the model for reasonableness, and if it's a reasonable fit, use it to compute the quartile. Along the way, use graphical methods because they can reveal interesting patterns. (Here, the interest is that there is an apparent departure from lognormality in the low income bracket: I would wonder why that occurs and what it might say about this population.) — whuber, Aug 22 '12 at 21:13
+1, great answer. Looks like I'm going to have to learn R yet. — dav, Apr 25 '14 at 17:01
why using pchisq (like above) or chisq.test (simulated for goodness-of-fit) gives different results ? Aren't the same probabilities? Which is better to use ? — frhack, Jun 30 '22 at 06:39
@frhack I don't understand what difference you are pointing to. If chisq.test gives a different result, I would suspect it hasn't been applied correctly. — whuber, Jun 30 '22 at 12:07
@frhack Yes, that's what you're saying. I am suggesting it does so because it has been misapplied. In particular, if all you are doing is passing a vector of bin counts and estimated counts to chisq.test, it has no way of knowing you estimated two parameters and so it will use the wrong value for the degrees of freedom parameter. — whuber, Jun 30 '22 at 13:06
@whuber Yes there is something wrong in my computation but cant't find what: chisq.test(pop, p=pred/sum(pred), simulate.p.value=F,rescale.p=F ) — frhack, Jun 30 '22 at 13:26
@frhack I already explained the problem. That's why my code in this answer has to compute the p-value itself rather than employing chisq.test. Asking chisq.test to simulate doesn't help, because it still doesn't account for your estimation of two parameters. To see intuitively why that's a problem, consider the extreme case where you employ so many parameters that you can fit each bin exactly. Your chi-squared statistic will be zero and chisq.test will inevitably give a tiny p-value, which evidently is meaningless. — whuber, Jun 30 '22 at 14:35
sorry... I'm almost there... But now I don't understand why (having two parameters) df=df <- length(pop) - 2 and not df <- length(pop) - 3 — frhack, Jun 30 '22 at 17:01
@frhack Good catch! That's an error on my part. I will correct the affected portions of this post. (It doesn't change any of the conclusions, but it's always good to get the numbers right.) — whuber, Jun 30 '22 at 17:06
@whuber so in this case the standard formula k - p -1 holds ? — frhack, Jun 30 '22 at 17:09

score 9 · Answer 2 · answered Aug 22 '12 at 21:49

Too long for a comment:

whubers's answer is as good as any, but he does assume right-skewness in his log-normal model. This may be realistic for incomes over a general population, but may not be for incomes for a single employer at a particular grade.

You could alternatively choose to model the distribution as being roughly symmetric in which case you might put about $68$ into the range 16-18, $64$ into 18-20 and $50$ into range 22-24 and this would give you a third quartile estimate of around $17.5$.

You would have a lower estimate if you chose to continue the frequency at about $80$ per double unit which would give you a third quartile estimate of around $17.3$.

Higher estimates are possible with other assumptions. So my conclusion would be that the third quartile point is likely to be above $17$, but that you really do not have enough data to make an accurate estimate without knowing (or assuming) more about the distribution of income at the top end, and that is precisely what you do not know.

(+1) Thank you for emphasizing (and analyzing) the dependence of the answer on the model assumptions. If (in the example) you cannot assume anything, then all you can say is that the third quartile exceeds $16$. If you do assume a model, then at least you can say to the consumer of your advice, "if your picture of the income distribution is at least roughly what I have assumed, then you can use my result as a reasonable estimate of the third quantile." (Most statistical conclusions are implicitly conditionals of this sort.) — whuber, Aug 23 '12 at 18:23