2

Equal-frequency binning divides the data set into bins that all have the same number of samples. Quantile binning assigns the same number of observations to each bin. What is the difference between both methods? It seems to me that both do the same and it is just a matter of terminology. Unfortunately, I could not find a clear answer.

References:

joni
  • 21
  • 1
    I don't know where the definitions come from, but to me the term quantile binning does not rule out using any quantiles you like as breakpoints, and so they don't have to be equally spaced. BTW, number of samples is terminology that makes sense to anyone dealing with say chunks or bits of water, sediment, tissue, whatever, but many statistical people would be more likely to say sample size (for the entire sample in hand) or number of observations for the sample or a subset of it. – Nick Cox Feb 03 '21 at 10:10

1 Answers1

4

For data that are continuous and have not been rounded so much that there are ties, both of the methods can be used to make intervals with equal counts, provided that the number of intervals is an even divisor of the number of observations.

Suppose I have $n = 80$ observations, and I want ten bins with 8 observations in each. [I'm using R to sample 80 observations from the distribution $\mathsf{Norm}(\mu=50,\sigma=7).]$ Then I could use deciles to make the cutpoints separating intervals.

set.seed(202)   # use this seed to get exactly same data again
x = round(rnorm(80, 50, 7) ,3)

I set the width of the print window so that exactly eight (sorted) observations will print on each line below. Numbers in [ ]s show the index (number in order) of the first observation on each line. This is not necessary, but may make it easier to follow through the process.

sort(x)
 [1] 33.491 35.961 36.467 37.278 37.752 39.986 40.121 42.078
 [9] 42.154 42.384 43.157 43.723 43.989 44.071 44.500 44.564
[17] 44.569 44.823 45.205 45.363 45.372 45.851 46.071 46.917
[25] 46.987 47.488 47.645 48.192 48.404 48.978 49.031 49.238
[33] 49.249 49.251 49.304 49.336 49.366 50.137 50.200 50.526
[41] 50.587 50.631 50.968 51.597 51.662 51.762 51.768 52.191
[49] 52.322 52.376 52.513 52.928 53.257 53.281 53.471 53.556
[57] 53.587 53.617 54.041 54.179 54.958 55.249 56.167 56.325
[65] 56.346 56.432 56.606 56.702 57.143 57.606 58.259 58.483
[73] 60.116 61.688 62.374 62.852 62.971 63.822 65.614 69.224

Check that there are no ties. (This is important because ties might force unequal counts among the ten intervals.)

length(unique(x))
[1] 80            # distinctly different values

Choose deciles 0%, 10%, 20%, ..., 90%, 100% as the cut points (bin boundaries:

cutp = quantile(x, seq(0, 1, by=.1));  cutp
     0%     10%     20%     30%     40%     50%     60% 
33.4910 42.1464 44.5680 46.9660 49.2446 50.5565 52.2434 
    70%     80%     90%    100% 
53.5653 56.3292 58.6463 69.2240 

Notice that the deciles separate the ten lines of the printout of the 80 observations above.

Now I'll make a histogram using these cutpoints for the bases of the histogram bars. The procedure rug puts tick marks along the horizontal axis to show exact positions of data values.

hist(x, br=cutp, col="skyblue2");  rug(x)

enter image description here

Ordinarily, this is not a good way to make a pretty histogram. (However, for certain kinds of printed tables, it may be helpful to have the same number of observations in each interval.) I'm showing the histogram so that you can see the effect of forcing the number of counts in each interval to be the same. In the figure above each of the ten bars has exactly the same area.

Usually, the software picks 'round' numbers for bin boundaries. But I wanted you to see that there are eight observations in each interval (histogram bin) in the figure above.

Here is a typical histogram drawn by R. There are different numbers of observations in each bin. (And the bars have correspondingly different areas.) In a 'density' histogram the total area of all the bars is $1.$

hist(x, prob=T, col="skyblue2");  rug(x)

enter image description here

In R, a non-plotting "histogram" displays information used to make the figure just above (for simplicity I have not shown the entire list).

hist(x, plot=F)

$breaks [1] 30 35 40 45 50 55 60 65 70

$counts [1] 1 5 12 19 24 11 6 2

$mids [1] 32.5 37.5 42.5 47.5 52.5 57.5 62.5 67.5

BruceET
  • 56,185
  • 1
    Your answer explains quantile binning very well but unfortunately it did not become clear for me whether there is a difference between quantile and equal frequency binning or not. As far as I understood, quantile binning does not necessarily imply that the bins have the same number of observations. But if the total number of observations divided by the number of bins is not even this would be also the case for equal frequency binning. – joni Feb 04 '21 at 12:24
  • Keep in mind that neither approach is consistent with mechanisms underlying the relationships. Shapes of relationships of X vs Y come from mechanisms and not from how many subjects in the sample are similar to a given subject. Also, binning has lots of problems. If binning, overlapping bins work better. – Frank Harrell Feb 04 '21 at 12:38
  • See my first paragraph. If there are no ties and if the number of observations is evenly divisible by the number of (disjoint) bins, then quantile binning ensures that bins have equal numbers of observations. – BruceET Feb 04 '21 at 17:43