3

I have a large dataset containing country names and names of musicians like this, with more than 50.000 rows:

Country Musician
australia Jimmy Barnes
australia Grinspoon
england Giles
united states of america Bob Dylan
united states of america Hamlet
united states of america Rick Astley
sweden Judith
united states of america The Beatles
jamaica JPM
germany Ruslana
russia Ruslana
ukraine Ruslana
united states of america Possessed
france Georges Brassens
greece Jacques Brel
france Dionysis Savvopoulos
greece Dionysis Savvopoulos
france Léo Ferré
greece Léo Ferré
united states of america Ulali
united states of america Zozobra
colombia Aterciopelados
colombia Carlos Vives
colombia Shakira
united kingdom The Smiths
united kingdom Morrissey

I would like to use pandas (as this data is in a dataframe) to determine if there is a correlation between the two columns, i.e. whether the country suggests which musician is named. Is this at all possible or am I completely wrong? The contigency table is 11949 rows × 190 columns if that is relevant. Thanks!

karkraeg
  • 131
  • Hi, I will research those terms, as if now I don’t know what they mean. – karkraeg Jan 12 '23 at 21:27
  • 1
    Can an artist appear multiple times paired with the same country? – dipetkov Jan 12 '23 at 21:52
  • Following up on @dipetkov , if an artist can appear multiple times for the same country, why? (There's nothing wrong with it, but it might affect the analysis.) – Dave Jan 12 '23 at 21:54
  • Hi @dipetkov and Dave, yes that sure will happen. The data is extracted from publications, so each row represents a published article or book that is about the country and the artist. Multiple publications can be about the Beatles in Spain for example. – karkraeg Jan 13 '23 at 06:25