Determine correlation between two categorical columns with lots of data

Question

I have a large dataset containing country names and names of musicians like this, with more than 50.000 rows:

Country	Musician
australia	Jimmy Barnes
australia	Grinspoon
england	Giles
united states of america	Bob Dylan
united states of america	Hamlet
united states of america	Rick Astley
sweden	Judith
united states of america	The Beatles
jamaica	JPM
germany	Ruslana
russia	Ruslana
ukraine	Ruslana
united states of america	Possessed
france	Georges Brassens
greece	Jacques Brel
france	Dionysis Savvopoulos
greece	Dionysis Savvopoulos
france	Léo Ferré
greece	Léo Ferré
united states of america	Ulali
united states of america	Zozobra
colombia	Aterciopelados
colombia	Carlos Vives
colombia	Shakira
united kingdom	The Smiths
united kingdom	Morrissey

I would like to use pandas (as this data is in a dataframe) to determine if there is a correlation between the two columns, i.e. whether the country suggests which musician is named. Is this at all possible or am I completely wrong? The contigency table is 11949 rows × 190 columns if that is relevant. Thanks!

Hi, I will research those terms, as if now I don’t know what they mean. — karkraeg, Jan 12 '23 at 21:27
Can an artist appear multiple times paired with the same country? — dipetkov, Jan 12 '23 at 21:52
Following up on @dipetkov , if an artist can appear multiple times for the same country, why? (There's nothing wrong with it, but it might affect the analysis.) — Dave, Jan 12 '23 at 21:54
Hi @dipetkov and Dave, yes that sure will happen. The data is extracted from publications, so each row represents a published article or book that is about the country and the artist. Multiple publications can be about the Beatles in Spain for example. — karkraeg, Jan 13 '23 at 06:25

score 0 · Answer 1 · answered Jan 21 '23 at 14:08

There are different correlation coefficients (Pearson, Spearman, ...) and broadly they measure whether as variable X increases, variable Y tends to increase/decrease as well. This requires that for both X and Y we have a concept of "ordering". Otherwise we won't be able to say what it means for the variable to "increase" or "decrease".

You have countries and artists. There isn't an obvious way to impose an order on either, so I would say that the correlation between countries and artists is not well-defined, or at least it's not obvious how to define it.

Here is how I might consider analyzing this dataset in your place.

I can take any two countries and look at how often the same artists are mentioned (in local publications about the music industry). I would compute the similarity (or perhaps the dissimilarity) between the pair of countries, and cautiously interpret this as a measure of similarity/dissimilarity in musical preferences. Once I calculate this measure for all pairs of countries, I have a square (dis)similarity matrix that I can analyze further. For example, for each country, I can find its "closest musical neighbor". Or I can cluster the countries to visualize patterns of shared musical tastes...

I can do a very similar analysis of the artists by looking at the coverage they receive in different countries.

And once I have done this I would probably ask if the patterns of similarities/dissimilarities I've discovered are associated with characteristics such as: continent, national language, popular musical styled etc. (for countries) and primary language of lyrics, number of albums, origin, preferred genre etc. (for artists).

Thanks! I think your approach would be interesting to try. I think I would have to use normalised values to take into account that for example the USA are much more common than greece in the whole dataset. Also I assume one would only do this for the most frequent artists? — karkraeg, Jan 23 '23 at 12:42
It's true that you would have to take "size" into account. Different number of publications were analyzed for each country, I guess. Some measures of similarity don't need normalized data, eg. cosine similarity. — dipetkov, Jan 25 '23 at 08:13
So you would have to do some exploration first, whatever methods you decide to pursue. I would suggest to pick a subset of countries and a subset of artist (not randomly but based on your knowledge/intuition) and make sure that the methods produce meaningful & interpretable results. Then you apply them to the whole data. — dipetkov, Jan 25 '23 at 08:14

Determine correlation between two categorical columns with lots of data

1 Answers1