2

Suppose we have a continuous random variable $X$. We do not know its distribution function, but have $n$ i.i.d. samples.

I am looking for methods that quantize (discretize) $X$ into a categorical variable $\tilde{X}$ with $k$ level.

The most naive method is the uniform quantization, where each quantization bin has the same length. However, it seems more reasonable to set the length of bins according to empirical density of $X$, as shown in the below figure.

enter image description here

I think there must be some standard text that deals with my problem, but have not found one yet. I would be really grateful if you could point to an authoritative reference that covers this topic.

  • 2
  • If you want to approximate the density, bins of equal length seem the most reasonable choice (check ?hist in R) – utobi Mar 27 '23 at 06:47
  • 2
    You can bin them by quantiles, then each bin will have equal density. – user2974951 Mar 27 '23 at 08:00
  • Thanks, binning by quantiles work! One may use pandas.qcut for implementation. – Mingzhou Liu Mar 28 '23 at 03:57
  • 1
    Every textbook on cartography I have seen (and I have a bunch) discusses this. Standard methods are equal intervals, equal areas (not relevant here, but hinting at the possibility of exploiting additional variables for the binning), quantiles, and "Jenks' method," which is just k-means clustering in 1D. Of course there are many other methods -- infinitely many. Which of these might be legitimate depends (very much) on the circumstances and your objectives. – whuber Mar 30 '23 at 18:34

2 Answers2

1

Binning a variable by percentiles with equal spacing will ensure intervals with the same density.

user2974951
  • 7,813
-2

It appears that you are looking for change points. Each change point becomes a bin boundary. The density is the number of points between change points, divided by the length of the bin.

Alternatively, you can do nearest neighbor smoothing, or lowess smoothing.

or GP
  • 1
  • The usual meaning of "change point" requires paired data. It's unclear how smoothing would address the question. – whuber Mar 30 '23 at 18:35
  • No, you can have change points in a univariate list (like a list of photon arrival times). See https://docs.astropy.org/en/stable/api/astropy.stats.bayesian_blocks.html#astropy.stats.bayesian_blocks for example. – or GP Apr 02 '23 at 05:48
  • That requires the list to be ordered and for you to create a variable corresponding to the index of the ordering. Although you could order a sample from a random variable, almost any changepoint technique would be inapplicable due to the correlations induced by that ordering. – whuber Apr 02 '23 at 12:52
  • Sure, but nothing in the original question says such an ordering is not possible. In fact, the presence of "value->" along the x-axis says that it is necessary. – or GP Apr 03 '23 at 14:32