1

I have a dataset with samples 0 and 1 data. Here each Id represents a sample no and 0 or 1 represents if the keyword(on the left: Water, Soil, etc) exists in the publication. The regional columns on the right (eg. Africa, Asia) say where the paper was published from, however, there are overlaps between regions(eg same publication has multiple country affiliations)

1. What kind of statistical tool I will need to find the correlation between the region (Europe, Africa, Asia) and the keywords (eg. water, Soil, waste, etc)*

2. What kind of statistical tool I will need to find if region influences the keywords?

Photo

akif
  • 11
  • 3
  • The keyword variables are binary, but not dummy ones. The region are dummy variables (they can be replaced by a single categorical variable Region). – ttnphns Sep 26 '21 at 14:37
  • 2
    It is unclear correlation between what and what precisely you want. – ttnphns Sep 26 '21 at 14:38
  • If $x_1$ is (0, 1) and so is $x_2$ then the correlation between them is just the ... correlation between them (so long as both variables have both 0 and 1 values). Unusually, but predictably, the Pearson and Spearman correlations are identical. See also https://stats.stackexchange.com/questions/103801/is-it-meaningful-to-calculate-pearson-or-spearman-correlation-between-two-boolea – Nick Cox Sep 26 '21 at 16:00
  • 1
    I would start looking into some kind of [tag:correspondence-analysis]. Maybe you could add that tag? Please also include your data in a readable format: Hi, there are blind and visually impaired users of this site who interact with it using screen readers. The screen readers can't handle the equation in your screenshot. (https://stats.meta.stackexchange.com/a/1605/155836). – kjetil b halvorsen Sep 27 '21 at 14:57
  • Since the number of regions is not too large you might try the chi-square test:

    Obtain the number of times a topic is addressed in any paper for each country. This will result in a table with countries as columns and topics covered as rows. Now a chi-square test can be done to check for dependence.

    Another method could be to find mutual information using joint and marginal probabilities for topics and countries.

    – Curious Dec 15 '23 at 14:17

1 Answers1

1

I would start looking into some kind of . If you recode your data as a contingency table, with regions as rows and keywords as columns. Then you might use a simple correspondence analysis.

The eigenvalue of the first eigenvectors could serve as a measure of correlation (well, really the second, as the first is always 1, but without interest).