3

I want to discretize a pandas continuous column. For discretization, I'm using Freedman-Diaconis rule which computes the optimal number of bins which will be given input to KBinsDiscretizer. Freedman-Diaconis' rule states that,

$$ \text{bin width}, h=2\frac{\operatorname{IQR}(x)}{n^{1/3}} $$ $$ \text{number of bins}, k = \frac{range(x)}{h} $$

The column has $32561$ values. After sorting, the first $29849$ elements are $0$. So in turn the $IQR(x) = 0$. So, a divide by zero occurs when calculating the number of bins. What can I do here?

Robur_131
  • 133

2 Answers2

4

If you look at R's hist.default function, the nclass.FD function computes the number of bins for Freedman-Diaconis rule. Here is the full code:

> nclass.FD
function (x) 
{
    h <- 2 * stats::IQR(x. <- signif(x, 
                        digits = 5))
    if (h == 0) {
        x. <- sort(x.)
        al <- 1/4
        al.min <- 1/512
        while (h == 0 && (al <- al/2) >= 
      al.min) h <- diff(stats::quantile(x., 
            c(al, 1 - al), names = 
       FALSE))/(1 - 2 * al)
    }
    if (h == 0) 
        h <- 3.5 * sqrt(stats::var(x))
    if (h > 0) 
        ceiling(diff(range(x))/h * 
         length(x)^(1/3))
    else 1L
}

So, as you can see, if h (the default calculation) == 0, it apparently continues to try to calculate a replacement for the IQR using quantiles al (starting at 1/4) and 1 - al (i.e. the 25% and 75% quantiles), halving "al" each time (i.e., checking the (12.5% to 87.5%) gap, divided by (1 - 2*al). This continues until al = 1/512. At that point, if "h" is still 0, it uses 3.5 times the standard deviation.

Sam A.
  • 171
3

If you really want to keep such a dataset in your collection, then you could use only the nonzero elements to decide on the number of bins.

Something like this demonstration in R:

Here are fake data, roughly like yours:

set.seed(322)
x1 = rnorm(10000, 100, 10)
x2 = rep(0, 20000)
x = c(x1, x2)

Plot only nonzero data to see how many bins they 'need'. Here 17. (Nonzero data spans about $1/3$ the total width of the histogram.)

b = length(hist(x1, plot=F)$breaks); b
[1] 17

Show three histograms: (1) Nonzero data with 17 bins. (2) All data with inadequately many bins to show detail of nonzero data. (3) All data with about $3(17) = 51$ bins; shows more detail for nonzero data.

par(mfrow=c(1,3))  
  # enable 3 panels per plot
  hist(x1, col="skyblue2")        # 1
  hist(x, col="skyblue2")         # 2
  hist(x, br=3*b, col="skyblue2") # 3
par(mfrow=c(1,1))  
  # return to single-panel plots

I'm not claiming that R uses the Freedman-Diaconis rule. Couldn't immediately find out. Maybe someone else here knows.

side-by-side plots of three histograms

BruceET
  • 56,185