2

I'm trying to calculate a correlation matrix for ordinal variables in R. Kendall rank correlation coefficient, seems a good option, as it "is a statistic used to measure the ordinal association between two measured quantities" (emphasis added).

Since I have variables with a different number of ordinal levels, I'm planning to use Stuart-Kendall Tau-c for accounting for ties when calculating the coefficient: "Tau-c (also called Stuart-Kendall Tau-c) is more suitable than Tau-b for the analysis of data based on non-square (i.e. rectangular) contingency tables."

The R package DescTools has a function StuartTauC which calculates "Stuart's Tau-c statistic, a measure of association for ordinal factors in a two-way table."

As an example, I will use three ordinal variables from the diamonds dataset in ggplot2:

# Import "diamonds" dataset from ggplot2
library( ggplot2 )

head( diamonds[2:4] )

A tibble: 6 x 3

cut color clarity

<ord> <ord> <ord>

1 Ideal E SI2

2 Premium E SI1

3 Good E VS1

4 Premium I VS2

5 Good J SI2

6 Very Good J VVS2

My implementation in R is as follows (I'm open to better implementations for calculating the matrix):

library( DescTools )

df <- diamonds[2:4]

cor_matrix <- matrix( nrow = ncol( df ), ncol = ncol( df ) ) rownames( cor_matrix ) <- names( df ) colnames( cor_matrix ) <- names( df )

for( row in 1:ncol( df ) ){ for( col in 1:ncol( df ) ){ cor_matrix[row, col] <- StuartTauC( df[[row]], df[[col]] ) } }

cor_matrix

Result:

cut color clarity

cut 0.89458402 -0.01356334 0.1464609

color -0.01356334 0.97953628 0.0232527

clarity 0.14646089 0.02325270 0.9405563

My question is, shouldn't the diagonal values be 1, or is this a feature of the Tau-C statistic (or the function StuartTauC)?

teppo
  • 121
  • 1
    Please check footnotes in https://stats.stackexchange.com/a/18136/3277 – ttnphns Sep 12 '22 at 16:48
  • I won't try to dig into the math behind the statistic, but returning a value value less than 1 even when correlating a variable with itself appears to have something to do with having an unequal number of observations in each level. Consider A correlated with A and B correlated with B. A = factor(c("Low", "Low", "Low", "High", "High", "High")); B = factor(c("Low", "Low", "Low", "High", "High")); StuartTauC(A, A); StuartTauC(B, B) – Sal Mangiafico Sep 12 '22 at 16:59
  • Thank you @ttnphns! If you could add the footnote on Tau-c as an answer here (maybe with an easy-to-understand refererence?), I will accept it. – teppo Sep 12 '22 at 18:15
  • Thanks also @SalMangiafico. I'll take a look at papers and textbooks describing Tau-c. – teppo Sep 12 '22 at 18:16
  • 1
    I've done some more reading. In Stuarts original paper (1953, doi:10.2307/2333101), when deriving tau-c, he concludes that "tc can sometimes attain, and for large n can generally almost attain, +-1." What still puzzles me is that the "diamonds" dataset has an n of 53940 which I consider rather large. Still, the diagonal values of the correlation matrix are quite far from 1 (especially for "cut"). – teppo Sep 15 '22 at 08:25

0 Answers0