Suppose I want to compare some countries to see how similar they are with respect to the relative sizes of the industries in them. So I find data on the distribution of GDP across all industries for each country, and end up with a probability distribution over industries for each country. What distance measure should I use to compare these probability distributions?
[Edit] To clarify: I am looking for a distance measure that makes sense of our intuitive judgements about how similar countries are with respect to their industry breakdowns. For example, suppose that I have five countries (1, 2, 3, 4, 5) and four industries (A, B, C, D), and I want to compare Countries 2, 3, 4, and 5 with respect to how similar they are to Country 1.
| Country | % GDP from ind. A | % GDP from ind. B | % GDP from ind. C | % GDP from ind. D |
|---|---|---|---|---|
| Country 1 | 40% | 40% | 20% | 0% |
| Country 2 | 35% | 45% | 15% | 5% |
| Country 3 | 0% | 60% | 30% | 10% |
| Country 4 | 0% | 80% | 0% | 20% |
| Country 5 | 0% | 0% | 80% | 20% |
The intuitive ranking is: Country 2 > Country 3 > Country 4 > Country 5.
- I think the difference between 10% and 20% for two countries with respect to some industry should probably count for the same as the difference between 80% and 90%. I can't think of a reason why it wouldn't.
- I assume a distance measure that uses ratios won't work very well given that there are zero-values?
- I don't want a measure that only takes into account the maximum difference between industries when you consider arbitrary industries from each country, since then Country 4 and Country 5 would end up counting as equally similar to Country 1.
- I don't want a measure that only takes into account the maximum difference when you compare two countries with respect to each industry, because then Country 3 and Country 4 would end up counting as equally similar to Country 1.
- But I think a measure that is based on the maximum difference when you compare two countries with respect to each combination of industries (as in Total Variation Distance?) might be ok?
- I think intuitively the measure should be symmetric; i.e., the degree to which Country x is similar to Country y = the degree to which Country x is similar to Country x.
- I guess one of the main things I don't understand is how I should weight differences between countries for the industries.
- Absolute difference in proportions - seems intuitive.
- Squared difference in proportions - what is the typical rationale for this? To give small differences even less weight relative to larger differences? I don't see why I would need to do this if the differences are already small to begin with.
- Squared difference in proportions divided by one of the proportions (like in Chi-squared distance) - what is the typical rationale for this? That the same sized difference is more significant when you're dealing with small values than when you're dealing with large values? I think maybe I don't care about that here?
Now I'm leaning towards Total Variation Distance or L1 norm. Does that sound like the right kind of approach?
I understand Euclidean distance, and using this for the comparisons seems intuitively plausible to me. However, Googling tells me that there are many other distance measures, and normally other ones are used to compare probability distributions. But, as someone without much background in statistics, I quickly get out of my depth when I try to understand whether and why they are suitable.
Could anyone point me in the right direction here? What do you think the most appropriate distance measure would be in this scenario, and why? (Also, if anyone can recommend any textbooks etc. that I can work through to get myself into a position where I can understand how to assess the merits of the myriad distance measures out there, and so answer a question like this for myself, please do.)