I'm trying to think of a way to measure that a categorical distribution of any size is concentrated in only a few bins, so not uniform. The best way I can think of is checking entropy, but that's kind of hard to assess, because either it's close to a uniform distribution or it's not unless there's something I'm missing. I also heard about kurtosis, but that's more a metric of tailedness. I'd like that, for example, if I have 5 classes and 2 of the classes make up 80% of the distribution, then I have a metric that can reflect this.
-
1You stated you wanted a metric that measures against uniformity of the bins. Therefore your expected size of each bin is equal, then you can measure against the observed sizes. See Chi-squared test. Whether this makes sense probably depends on the context. – statsplease Nov 13 '21 at 03:03
-
2Maybe look at Wikipedia on diversity indexes, especially Simpson Index. – BruceET Nov 13 '21 at 08:59
-
2What's wrong with entropy? Or with the sum of squared probabilities (or its complement, or its reciprocal) invented and re-invented by many under different names for about a century? – Nick Cox Nov 13 '21 at 10:40
-
So I am not sure if you have a misunderstanding, but statistical metrics are typically deviations from ' boring', the null hypothesis. And so eg you would measure deviation from uniform, rather than ' clumpiness' directly. As others have said have a look at chisquared tests and see also g-tests https://en.wikipedia.org/wiki/G-test, which might make connection to entropy more clear – seanv507 Nov 14 '21 at 18:54
4 Answers
Imagine your categorical variable (with many levels) is species (in biostatistics context). Then your question can be formulated as about how to measure biodiversity. The same question can be asked in many contexts, for instance, in economics about income inequality. So you are asking for a way of measuring diversity or inequality.
There are many such indices, for instance Gini coefficient or Simpson coefficient. Below some other posts here about diversity indices:
- 77,844
-
1This is the only answer so far that answers the question. Others change the question to how to test given some null hypothesis, e..g uniformity. – Nick Cox Dec 17 '21 at 10:35
You can use the classical occupancy test
Another test you can use here is the classical occupancy test which uses the classical occupancy distribution (see e.g., O'Neill 2021). If the true distribution is uniform over the categories then the number of occupied bins with $n$ balls allocated over $m$ bins is the classical occupancy distribution. Moreover, any deviation from uniformity will tend to decrease the number of occupied bins, since it tends to cause balls to concentrate in a smaller number of bins. Consequently, the occupancy number can be used as a test statistic, with lower values more conducive to the alternative hypothesis of non-uniformity.
Implementation: Here is some code in R to create the classical occupancy test. The test computes the p-value using the occupancy distribution in the occupancy package. The function takes in a value n for the number of balls, m for the number of bins and occupancy for the occupancy number in the data. The null hypothesis of the test is that the allocation is uniform and the alternative hypothesis is that it is non-uniform.
occupancy.test <- function(n, m, occupancy) {
#Check inputs
if (!is.numeric(n)) { stop("Error: Input n should be a positive integer") }
if (length(n) != 1) { stop("Error: Input n should be a single positive integer") }
if (n != as.integer(n)) { stop("Error: Input n should be a positive integer") }
if (n <= 0) { stop("Error: Input n should be a positive integer") }
if (!is.numeric(m)) { stop("Error: Input m should be a positive integer") }
if (length(m) != 1) { stop("Error: Input m should be a single positive integer") }
if (m != as.integer(m)) { stop("Error: Input m should be a positive integer") }
if (m <= 0) { stop("Error: Input m should be a positive integer") }
if (!is.numeric(occupancy)) { stop("Error: Input occupancy should be an integer") }
k <- as.integer(occupancy)
if (length(k) != 1) { stop("Error: Input occupancy should be a single integer") }
if (k != occupancy) { stop("Error: Input occupancy should be an integer") }
if (occupancy < 0) { stop("Error: Input occupancy should be a positive integer") }
if (occupancy > min(n,m)) { stop("Error: Input occupancy cannot be larger than n or m") }
#Set test content
method <- 'Classical occupancy test'
data.name <- paste0('Occupancy number ', occupancy, ' from allocating ', n,
' balls to ', m, ' bins')
alternative <- 'Allocation distribution is non-uniform'
statistic <- k
attr(statistic, 'names') <- 'occupancy number'
p.value <- occupancy::pocc(k, size = n, space = m)
#Create htest object
TEST <- list(method = method, data.name = data.name,
null.value = NULL, alternative = alternative,
statistic = statistic, p.value = p.value)
class(TEST) <- 'htest'
TEST }
Here is an example of a classical occupancy test for $n = 18$ data points over $m = 12$ bins. In the example we generate data from a random categorical distribution using uniformity over the categories. We get a p-value of $0.8264271$, so we accept the null hypothesis of uniformity.
#Generate some random categorical data (for the uniform case)
set.seed(1)
DATA <- sample.int(12, size = 18, replace = TRUE)
#Compute and print occupancy number
OCC <- length(unique(DATA))
OCC
[1] 10
#Conduct the classical occupancy test
occupancy.test(n = 18, m = 12, occupancy = OCC)
Classical occupancy test
data: Occupancy number 10 from allocating 18 balls to 12 bins
occupancy number = 10, p-value = 0.8264
alternative hypothesis: Allocation distribution is non-uniform
Here is another example of a classical occupancy test for $n = 18$ data points over $m = 12$ bins. In the example we generate data from a random categorical distribution where the probabilities are concentrated on only a few bins. We get a p-value of $0.0001046$, so we reject the null hypothesis of uniformity.
#Set non-uniform bin probabilities (concentrated on only a few bins)
PROBS <- c(0.01, 0.31, 0.01, 0.01, 0.01, 0.01, 0.44, 0.16, 0.01, 0.01, 0.01, 0.01)
#Generate some random categorical data (for the non-uniform case)
set.seed(1)
DATA2 <- sample.int(12, size = 18, replace = TRUE, prob = PROBS )
#Compute and print occupancy number
OCC2 <- length(unique(DATA2))
OCC2
[1] 5
#Compute and print p-value for classical occupancy test
occupancy.test(n = 18, m = 12, occupancy = OCC2)
Classical occupancy test
data: Occupancy number 5 from allocating 18 balls to 12 bins
occupancy number = 5, p-value = 0.0001046
alternative hypothesis: Allocation distribution is non-uniform
- 124,856
If you want to know if several categories have significantly
different proportions of the data, then you might use prop.test in R.
(Because this test is much the same as a chi-squared test, this
is essentially a repeat of the suggestion in @epp's Comment.
For example. if you have four categories with counts 10, 42, 20, and 15
out of 87, then you could use prop.test as below. The very small P-value tells you that there are significant differences among the
proportions 0.115, 0.483, 0.230, and 0.172.
barplot(c(10, 42, 20, 15))
prop.test(c(10, 42, 20, 15), rep(87, 4))
4-sample test for equality of proportions
without continuity correction
data: c(10, 42, 20, 15) out of rep(87, 4)
X-squared = 36.582, df = 3, p-value = 5.639e-08
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 prop 4
0.1149425 0.4827586 0.2298851 0.1724138
Thus, it is worth looking to see if the largest proportions is markedly larger than the next smaller one.
prop.test(c(42, 20), c(62, 62))$p.val
[1] 0.0001621318
To avoid false discovery from repeated tests on the same data (ad hoc testing), it may be best to answer Yes only if the P-value is less than 5%/4 = 1.25%.'
By contrast, it is not clear to me what you would think of the data 10, 32, 30, and 15. Here the proportions are clearly different, but the largest is not significantly larger than the next smallest. Is there a "peak"?
prop.test(c(10, 32, 30, 15), rep(87, 4))$p.val
[1] 6.943152e-05
prop.test(c(32,30), c(62,62))$p.val
[1] 0.8574624
Sometimes the ad hoc P-value may help you decide, but not always. If you have 100 times as much data, "everything" is significant [and the bar plot (omitted) looks just the same except for the numbers on the vertical axis.]
prop.test(c(1000, 3200, 3000, 1500), rep(8700, 4))$p.val
[1] 0
prop.test(c(3200,3000), c(6200,6200))$p.val
[1] 0.0003513735
In general, I wouldn't want to use the chi-squared test statistic as a 'metric'. See @NickCox's comment below.
- 56,185
-
2The chi-square test here is, evidently, a test of uniformity, but it's not an especially good measure. Even in its own terms a chi-square statistic can't be assessed without knowing the associated number of degrees of freedom. – Nick Cox Nov 13 '21 at 10:38
-
-
2
I think that a metric like deviation from uniformity is best. Here I have generated a series of categorical distributions (the plots below show bar charts of the probabilities). Categories that are unobserved should be ignored. For instance, the top two plots are both uniform, but the first with 20 categories and the second with 3. It's possible one would want uniformity across 20 categories to be "more uniform" than 3. Note that here I assumed that all we have is the observed categories, and that any unobserved are ignored, which may not be realistic (e.g. in observing US states, you may want to measure the observed distribution to uniformity over all 50 states rather than just over the observed ones).
I calculated 3 different metrics:
- Kurtosis: taking a sample of 1000 values from 1:number of classes, according to the probability distribution (e.g., if 3 categories, then from 1,2,3); the probability distribution is sorted from high to low probability, so it is strictly non-increasing. Note, Kurtosis is not sensitive to scale and location changes, which is good. However, it is sensitive to the number of categories (i.e. unique values), for instance, in the first two plots. Notice also that the first plot and the 6th have very similar kurtosis scores, even though the 6th has two high peaks. However, it is possible kurtosis is not appropriate since the data is categorical, even though it is not sensitive to location and scale.
- Jensen Shannon: Here I calculate JS distance (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.jensenshannon.html) with base=2 (making it bounded in [0,1]), where the comparison distribution is uniform with number of nonzero categories observed. This seems to be the best. Uniform distributions have 0 distance.
- Entropy: Entropy, using base = number of nonzero observed categories; this makes it not sensitive to the number (e.g. any uniform has entropy=1). The 5th and 8th have entropies close to 1.
It seems like JS distance is the most responsive to deviation from uniformity (in addition to being valid when one of the categories has 0 probability but you want to account for it).
- 171


