Similarity percentage based on attributes and their quantities

Question

I have two sets that contain (partially overlapping) attributes in different quantities and I am looking for a method to compare the similarity between the sets based on the attributes and their quantities. I am new to this and I would appreciate any help!

Update: example

Set A contains pizza: 3, soda: 5, cake: 2

Set B contains: pizza: 4, cake: 1

How do I compare how similar those two sets are?

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — utobi, Oct 11 '22 at 13:34

utobi · Answer 1 · 2022-10-11T19:25:02.073

One simple to measure the degree of similarity between two sets is to take the difference between their relative frequencies.

Let all possible elements be $x_1,\ldots,x_k$, and suppose set A has $m$ elements of which, $m_{1}$ are of type $x_1$, $m_2$ elements of type $x_2$ and so on, $m_k$ of type $x_k$. Similarily for the set B, suppose it has $n$ elements of which, $n_1$ are of type $x_1$, $n_2$ of type $x_2$ and so on, $n_k$ of type $x_k$.

Then consider the index $$ d_{AB} = \sum_{i=1}^k |m_i/m - n_i/n|. $$

$d_{AB}$ reaches its minimum value, i.e. 0, when A and B have the same number of elements proportionally to their sizes; it assumes a value equal to 2 if the two sets are totally disjoint. Thus the higher the value of $d_{AB}$ the higher the degree of dissimilarity between the two sets in terms of the relative frequencies of their elements.

Example. Let the data be as in the post, thus $x_1$ is pizza, $x_2$ is soda and $x_3$ is cake. In this case, we have $m_1=3$, $m_2=5$, $m_3=2$ and $n_1=4$, $n_2=0$, $n_3=1$ so $$d_{AB} = \Big|\frac{3}{10}- \frac{4}{5}\Big| + \Big|\frac{5}{10}-\frac{0}{5}\Big| + \Big|\frac{2}{10}-\frac{1}{5}\Big| = 1.$$

The value of $d_{AB}$ is halfway between its minimum and maximum value so we may conclude that the two sets are partially similar.

Remark

Here I showed you one way to solve the problem, but it's easy to come up with many different ways of quantifying "similarity" in this context. It's hard to choose among them a priori without further details about why you seek to measure similarity, and what you will conclude from the result.

For the example with the pizzas I get 0, although the similarity should be over 50% (they have 3 pizzas and 1 cake in common) — coffee-and-code, Oct 11 '22 at 15:01
For the example in your post, $d_{AB} = 1$, thus the sets are dissimilar. — utobi, Oct 11 '22 at 15:07
No, I actually got 0. Even if it was 1 (which means 100% probability), that means nothing, since this example isnt meant to be calculated as a probabiloty, but I am looking for a similarity ratio rather in %. — coffee-and-code, Oct 11 '22 at 15:14

Ryan C · Answer 2 · 2023-11-10T16:43:24.800

There are many ways to compare them. Which is most useful depends not only on what the data look like, but also which differences you find most interesting. Your preferences may change, and if so the answer you find most useful may also change.

One of the things you need to decide is whether you are interested in differences in ratios, or differences in absolute counts. If set C has four thousand pizzas and one thousand cakes, is that a perfect score match for B, because the ratios are equal, or almost zero match, because the total numbers are so far apart? That is a matter of preference, which can vary based on how you want to look at things. If you think ratios are the most important, the other answer is fine, but it is far from the only option. In such cases I usually use the mutual information or Bhattacharyya divergence, or their alternate forms as information distance and Hellinger distance. I avoid Kullback–Leibler divergence because it's not symmetric, but that may not bother you the way it does me.

If absolute counts do matter, then none of those methods is appropriate, because they all assume that you're comparing probability distributions, which only describe relative ratios. Here's an easy method that does work for absolute counts: define the similarity $S=|A\cap B|/|A\cup B|$, the size (total number of objects contained, or "cardinality") of the intersection divided by the size of the union. Now, since we're talking about sets with multiplicity, intersection and union are a bit trickier to define than for simple yes-or-no sets, but it's still pretty straightforward. In particular, let $A\cap B$ be the largest multiset which is a subset of both A and B, and let $A\cup B$ be the smallest multiset which contains both A and B as subsets. Then in your example, the intersection is {pizza: 3, cake: 1} and the union is {pizza: 4, soda: 5, cake: 2}, so the similarity is 4/11, on a scale from 0 to 1. Zero happens when there are no items in common, and one happens when the two multisets are identical.

A very flexible method is to consider your sets as column vectors, and define measures of similarity as row vectors of weights to dot-product with your column vectors. Different weight vectors give different similarity metrics; common examples are price, mass, volume, calories, grams of sugar, etc. This leads to classic optimization concepts like the knapsack problem, such as considering how to obtain the largest number of calories with a fixed number of dollars. Which weight vector is best depends on the question you want your data to answer, which is entirely up to you.

Similarity percentage based on attributes and their quantities

2 Answers2