3

I remember something from a stats course many years ago which might be helpful now. I want to distinguish between patients who show symptoms all over the board, vs. patients who have similarly high sum-scores, but due to a few very high symptoms. Symptoms have a scale from 0 to 3.

I remember the following procedure (imagine we have 3 symptoms per patient):

  1. Calculate the percentage each symptom in relation to the sum-score of that person's symptoms, so s1/sum, s2/sum, s3/sum. Let's call these 3 values p-values.
  2. Build natural log of these 3 values. Let's call these p(ln)-values. (If original value is 0, result of this should be 0, or change scale that there are no 0s anymore)
  3. Calculate: -2*p*p(ln) per symptom.
  4. Sum these up over all symptoms.

Using this procedure, I get the same values if all symptoms are equally high (no matter if they are all 1 or all 3), which is what I want. However, the variance in the value between participants with very equal answering patterns, and only few very high symptoms, are quite minimal, which could be due to the small symptom scale 0-3 (the differences get larger using higher values).

I am not sure whether I misremember the name "entropy", or the formula, and would appreciate help. Could I "inflate" the differences, e.g. using 1 10 100 1000 instead of 0 1 2 3 as symptom values?

Torvon
  • 1,113

1 Answers1

2

Try to look at diversity index (e.g. Gini coefficient), in fact being some rescaled Rényi entropy:

$$H_q(p_1,\ldots,p_n) = \frac{1}{1-q}\log\left(p_1^q+\ldots+p_n^q\right),$$ or equivalently (again, with some rescaling) - Tsallis entropy: $$\frac{1}{1-q}\left(1 - p_q^q-\ldots-p_n^q\right).$$

After normalizing input (if they are not probabilities, you can rescale them to your taste) so that they sum up to $1$, use your favorite entropy with some parameter $q$.

Or if you want something really simple, just use $$p_1^q+\ldots+p_n^q,$$ as it is always between 0 and 1.

Piotr Migdal
  • 5,776