Mutual Information for variables w/o any pairwise data

Question

When playing with an [example for a mutual information matrix][1] I realized I actually do get results even for a pair of variables where there is not a single observation where both variables are populated. Minimal viable code is:

library(infotheo)
num1 <- c(1,2,3,1,2,3,1,2,3)
num2 <- c(10,20,30,10,20,30,10,20,30)
num3 <- c(30,20,10,30,20,10,NA,NA,NA)
num4 <- c(12,21,3,0,5,7,22,3,100)
num5 <- c(NA,NA,NA,NA,NA,NA,1,2,3)
df.num <- cbind.data.frame(num1,num2,num3,num4,num5)
cor.matrix.nats <- (mutinformation(discretize(df.num, disc="equalwidth", nbins=NROW(df.num)^(1/2))))
cor.matrix.nats <- cbind.data.frame(row.names(cor.matrix.nats),cor.matrix.nats)
cor.matrix.Spearman <- cor(df.num, method="spearman", use="pairwise.complete.obs")
cor.matrix.Spearman <- cbind.data.frame(row.names(cor.matrix.Spearman), cor.matrix.Spearman)

You can see that cor produces NA for num3 vs. num5 even with use="pairwise.complete.obs" which makes sense to me. Checking what is going on in the background shows that discretize seems to assign -2147483647 to NA so just another value instead of no value:

> unique(discretize(df.num, disc="equalwidth", nbins=5)$num5)
[1] -2147483647           1           3           5

I want to use the matrix to remove correlated variables but in the matrix it is no longer obvious which values stem from a discrete value of -2147483647:

> cor.matrix.nats$num5
[1] 0.3662041 0.3662041 0.6365142 0.3488321 1.0027183
> cor.matrix.num.Spearman$num5
[1] 1.0 1.0  NA 0.5 1.0

Anyone's got an idea how to identify these values as to replace them with e.g. 1 such that using a threshold < 1 for variable removal will leave them untouched?

[1]: Mutual Information for unordered variables)

score 0 · Answer 1 · answered Jun 06 '22 at 14:56

I found a workaround which is not entirely bullet proof so I won't accept it as an answer but it turns out that one can replace that specific value of -2147483647 used by disrectize as bin number for any NA with NA and mutinformation still works. So the modified code looks like so:

library(infotheo)
num1 <- c(1,2,3,1,2,3,1,2,3)
num2 <- c(10,20,30,10,20,30,10,20,30)
num3 <- c(30,20,10,30,20,10,NA,NA,NA)
num4 <- c(12,21,3,0,5,7,22,3,100)
num5 <- c(NA,NA,NA,NA,NA,NA,1,2,3)
df.num <- cbind.data.frame(num1,num2,num3,num4,num5)
df.disc <- discretize(df.num, disc="equalwidth", nbins=5)
df.disc[df.disc==-2147483647]<-NA
cor.matrix.nats <- mutinformation(df.disc)
cor.matrix.nats <- cbind.data.frame(row.names(cor.matrix.nats),cor.matrix.nats)
cor.matrix.Spearman <- cor(df.num, method="spearman", use="pairwise.complete.obs")
cor.matrix.Spearman <- cbind.data.frame(row.names(cor.matrix.Spearman), cor.matrix.Spearman)

However, can I rely on the fact that discretize will never assign -2147483647 as a real bin number associated to data instead of to NA? Also, I'm still curious why discretize behaves like that. There is probably a good reason for it that I just can't see...

Mutual Information for variables w/o any pairwise data

1 Answers1