2

I have a set of sets and I want to assign a measure of variability to it. If all member sets are the same (e.g., S0={{a,b},{a,b},{a,b}}), the measure shall be 0. If some members are different (e.g., S1={{a,b},{a,b},{a,b,c}} and S2={{a,b},{a,b,c,d,e},{a,b,f,g,h}}), it shall be strictly greater than 0, and the magnitude shall be bigger for the sets which are intuitively more diverse (S1 is less diverse than S2).

The measure does not have to be normalized, so I am thinking of some sort of entropy, but cannot figure out how to calculate it.

DYZ
  • 167
  • 1
    Variability of what quantity? – whuber Aug 19 '16 at 18:58
  • Variability of set composition. I want to know to what extent the set is homogeneous. – DYZ Aug 19 '16 at 19:01
  • Before you can assess the variability of something, you have to be able to measure it. How are you measuring "composition"? – whuber Aug 19 '16 at 19:51
  • But that's exactly my question: How to measure the variability of a set of sets? – DYZ Aug 19 '16 at 19:52
  • There is a preceding question that must be answered first: there is no "variability" in your data until you have a number. Exactly what number are you attaching to each member set? – whuber Aug 19 '16 at 19:53
  • This is not strictly true. A set of all identical sets has zero variability, no matter what the member sets elements are. I am ready to consider any measure that allows me to compare two sets of sets for being more - or less - "diverse." (Let's use the word "diversity" if "variability" is so strongly associated with numeric data.) – DYZ Aug 19 '16 at 20:04
  • As ecologists know, "diversity" can mean many things and be measured in many different ways. Why don't you explain what your statistical problem is so we can understand it? What do these sets represent and what is the idea behind "diversity"? – whuber Aug 19 '16 at 20:15
  • I gave the most general, set theoretical description of my problem. The actual nature of the set elements is irrelevant to the question. Saying that "diversity" can be measured in many different ways and not giving any examples is not helpful. – DYZ Aug 19 '16 at 20:40
  • Typically a set is by definition not allowed to contain duplicates (e.g. see here). If you associate a "count" with your set elements, this is then more like a probability measure over the set (but un-normalized). So in your case you may have something like a probability space where the sample space is the power set over some "alphabet"? – GeoMatt22 Aug 20 '16 at 23:54

1 Answers1

1

In this answer I assume you really mean "set" rather than "multiset" (as we might see more typically in statistics).

One measure of set-similarity is Jaccard similarity $J(A,B) = {{|A \cap B|}\over{|A \cup B|}}$ (also called the Jaccard index). That is, the number in both sets divided by the number in either set.

Correspondingly, the Jaccard dissimilarity between two sets is $1-J(A,B)$.

We could generalize the Jaccard similarity to more than two sets readily enough.

If $A$ is a collection of sets $A_1, A_2,...,A_n$, then $J(A) = {{|\bigcap_i A_i|}\over{|\bigcup_i A_i|}}$ and then perhaps define a measure of dissimilarity (taking on some sense of "variability") as its complement, $1-J(A)$.

(However numerous other similarity measures exist, as whuber points out; it depends what you want to measure)


You mention using entropy (by which I assume you mean something like cross-entropy).

To work with cross entropy you'd need to assign some sort of probabilities to the elements.

If the sets were finite, and one were to define the probabilities uniformly that might work, but

  • cross entropy is also not symmetric; you'd presumably want a symmetric measure (you could perhaps add the two cross entropies $d(p,q)=H(p,q)+H(q,p)$).

  • then you'd need to generalize to more than two sets, possibly by summing all the pairwise $d$s. However

  • I don't think this would be especially satisfactory as it stands since it's not 0 when the sets are the same.


Related to it but better still might be the symmetrized Kullback-Liebler divergence. Again you would need to generalize to multiple sets.

Hopefully these give you some ideas. You should probably look around some of the other similarity and dissimilarity indices that already exist.

Glen_b
  • 282,281