Single statistics resuming the variance of a two-dimensional discrete matrix

Question

I have a set of data described by two qualitative variables. The data can therefore be analysed as a matrix, showing how many occurrences there are at each intersection of the two variables' levels.

I am looking for a measure to express the difference between two such matrices: the first, in which all intersections of the two variables are equally populated (uniform distribution), and the second, in which some intersections are much more (or much less) populated than others. Here are two matrices like the ones I have: they are counts of positive integers in all cells (r code).

m1 <- matrix(c(3,3,3,
               3,3,3,
               3,3,3), nrow=3, ncol=3, byrow=TRUE)    
m2 <- matrix(c(6,3,0,
               0,6,0,
               3,3,6), nrow=3, ncol=3, byrow=TRUE)

I want to express the idea the the first matrix is more uniform than the second. Is there a measure that I can use? I believe that there should be a straightforward answer, something like a generalisation of variance for matrices. I just want to be sure to get the right one.

My inclination would be to use $\chi^2$ with m1 as expected and m2 as observed. Is there any reason why you rejected that idea? — mdewey, Apr 03 '17 at 10:17
Thanks mdewey for your answer: no reason at all, I simply do not know which measure to choose and how to procede. My question was a genuine one. Could you please elaborate further? Would Chi2 provide a single statistics for the entire matrix, or for each cell? — Dario Lacan, Apr 03 '17 at 10:36
With the matrices posted, I think this is clear enough to be answered. I'm voting to leave open. — gung - Reinstate Monica, Apr 03 '17 at 16:22
Is the total of all numbers meaningful? Or do you only care for uniformness of the matrix? — Aksakal, Apr 03 '17 at 18:12

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

These are contingency tables. In your matrix m1, you have the counts associated with a null hypothesis in which the cell probabilities are all the same. That is somewhat different from the typical case of using a chi-squared test on a contingency table. The default test would check if the variables are independent, which is to say, does being in one row (column) make you more likely to be in a particular column (row) than being in a different row (column) would? That null is considerably less restrictive than yours, so we cannot use the default chi-squared test setup, but we can use the chi-squared test with a custom setup.

In essence, you are after a chi-squared test for goodness of fit, with a particular null specified. Thus, you just need to ask your software for that and specify the null you want. Any software should be able to do that for you; I will demonstrate this with R.

chisq.test(x=as.vector(m2), p=as.vector(m1)/sum(m1))
#   Chi-squared test for given probabilities
# 
# data:  as.vector(m2)
# X-squared = 18, df = 8, p-value = 0.02123

R complains about the above test, so we can check it by simulating the p-value, instead of relying on the chi-squared distribution with 8 degrees of freedom being correct. There doesn't seem to be much problem:

set.seed(6625)
chisq.test(x=as.vector(m2), p=as.vector(m1)/sum(m1), simulate.p.value=TRUE)
#   Chi-squared test for given probabilities with
#   simulated p-value (based on 2000 replicates)
# 
# data:  as.vector(m2)
# X-squared = 18, df = NA, p-value = 0.02449

The above gives you a test of the hypothesis that your observed matrix m2 comes from a population with the pattern specified in the expected matrix m1. Alternatively, if both m1 and m2 are observed matrices, and you wonder if they differ from each other, you need to use a log linear model for multi-way contingency tables (I discuss this more thoroughly here: $χ^2$ of multidimensional data).

# this creates the multi-way contingency table:
tab      = rep(NA, 18)  
dim(tab) = c(3,3,2)
tab[,,1] = m1;  tab[,,2] = m2
tab      = as.table(tab)
names(dimnames(tab)) = c("row", "column", "matrix")
tab
# , , matrix = A
#    column
# row A B C
#   A 3 3 3
#   B 3 3 3
#   C 3 3 3
# 
# , , matrix = B
#    column
# row A B C
#   A 6 3 0
#   B 0 6 0
#   C 3 3 6

library(MASS)                             # we'll use this package
m.sat = loglm(~row*column*matrix, tab)    # this is the saturated model
m1    = loglm(~matrix + row*column, tab)  # assumes the r*c pattern is = by m
anova(m1, m.sat)                          # nested model test of m1 vs m.sat
# LR tests for hierarchical log-linear models
# 
# Model 1:  ~matrix + row * column 
# Model 2:  ~row * column * matrix 
#           Deviance df Delta(Dev) Delta(df) P(> Delta(Dev)
# Model 1   15.53483  8                                    
# Model 2    0.00000  0   15.53483         8        0.04954
# Saturated  0.00000  0    0.00000         0        1.00000

Notice that this version is less powerful, because there could be sampling error in the observed m1 counts, whereas the chi-squared test above assumes those counts were specified a-priori.

Your use of the word "measure" is somewhat ambiguous to me. If you are interested in a measure of effect size (i.e., how far is m2 from the uniform), you can just take the $N$ (or more literally, $\sqrt N$) out of the chi-squared test statistic. That gives you the $\phi$ coefficient.

Hi Gung, and thanks for your very detailed answer. Sorry for my rather unclear explanation. My real data are two observed matrices. The first are occurrences of journal articles described by two variables (theme and media). The second is a matrix of the same articles, described by the same two variables, but now each article is weighted. My real question therefore is: does the weighting significantly affect the distribution of the articles? If I understand correctly your answer, I can use a log linear model for multi-way contingency tables, as you explained at the end. — Dario Lacan, Apr 12 '17 at 09:53
However, my problem is the the real data differ enourmously from the first to the second matrix, as in the first one each article has the weight of 1 (so, for example, if Le Monde published 100 articles on foreign policy, the cell Le Monde / Foreing policy has the value of 100). In the second, each article may weight even 100.000 or more. So the value in the corresponding cell of my example (Le Monde / Foreing policy) in the second matrix may well be 6.000.000. I am looking for a way to "normalize" this. — Dario Lacan, Apr 12 '17 at 09:57
@DarioLacan, I'm not sure if I follow that. If this is the same set of observations that are either unweighted or weighted, they have to be different unless the weights are all $1$s. There is no need for any statistical testing. I also don't understand what it would mean to normalize a weighted dataset to make it equivalent to what it would be unweighted. — gung - Reinstate Monica, Apr 12 '17 at 12:23
The first matrix has its equilibrium: the articles have each the weight of 1, the cells have for values the sums of these weights, and that's the starting point. The second matrix represents the same articles under different conditions, modifying their relative weights. Therefore my point: has the equilibrium changed after weighting? If all articles in the second matrix had the same new weight (let's say 100), the answer would be no. If some articles in one cell had a lot of weight, and those in another cell didn't, the answer would be yes. (Maybe it can be thought as a kind of ANOVA?) — Dario Lacan, Apr 12 '17 at 15:17
@DarioLacan, I really don't follow that. But, if you take your 1st matrix as a null hypothesis, and want to see if your weighting scheme as changed it, you could use the chi-squared GoF test I discuss at the beginning--it wouldn't be the log-linear model for that. — gung - Reinstate Monica, Apr 12 '17 at 15:28

Single statistics resuming the variance of a two-dimensional discrete matrix

1 Answers1