0

When playing with an [example for a mutual information matrix][1] I realized I actually do get results even for a pair of variables where there is not a single observation where both variables are populated. Minimal viable code is:

library(infotheo)

num1 <- c(1,2,3,1,2,3,1,2,3) num2 <- c(10,20,30,10,20,30,10,20,30) num3 <- c(30,20,10,30,20,10,NA,NA,NA) num4 <- c(12,21,3,0,5,7,22,3,100) num5 <- c(NA,NA,NA,NA,NA,NA,1,2,3)

df.num <- cbind.data.frame(num1,num2,num3,num4,num5)

cor.matrix.nats <- (mutinformation(discretize(df.num, disc="equalwidth", nbins=NROW(df.num)^(1/2)))) cor.matrix.nats <- cbind.data.frame(row.names(cor.matrix.nats),cor.matrix.nats)

cor.matrix.Spearman <- cor(df.num, method="spearman", use="pairwise.complete.obs") cor.matrix.Spearman <- cbind.data.frame(row.names(cor.matrix.Spearman), cor.matrix.Spearman)

You can see that cor produces NA for num3 vs. num5 even with use="pairwise.complete.obs" which makes sense to me. Checking what is going on in the background shows that discretize seems to assign -2147483647 to NA so just another value instead of no value:

> unique(discretize(df.num, disc="equalwidth", nbins=5)$num5)
[1] -2147483647           1           3           5

I want to use the matrix to remove correlated variables but in the matrix it is no longer obvious which values stem from a discrete value of -2147483647:

> cor.matrix.nats$num5
[1] 0.3662041 0.3662041 0.6365142 0.3488321 1.0027183
> cor.matrix.num.Spearman$num5
[1] 1.0 1.0  NA 0.5 1.0

Anyone's got an idea how to identify these values as to replace them with e.g. 1 such that using a threshold < 1 for variable removal will leave them untouched?

[1]: Mutual Information for unordered variables)

MarkH
  • 197

1 Answers1

0

I found a workaround which is not entirely bullet proof so I won't accept it as an answer but it turns out that one can replace that specific value of -2147483647 used by disrectize as bin number for any NA with NA and mutinformation still works. So the modified code looks like so:

library(infotheo)

num1 <- c(1,2,3,1,2,3,1,2,3) num2 <- c(10,20,30,10,20,30,10,20,30) num3 <- c(30,20,10,30,20,10,NA,NA,NA) num4 <- c(12,21,3,0,5,7,22,3,100) num5 <- c(NA,NA,NA,NA,NA,NA,1,2,3)

df.num <- cbind.data.frame(num1,num2,num3,num4,num5) df.disc <- discretize(df.num, disc="equalwidth", nbins=5) df.disc[df.disc==-2147483647]<-NA

cor.matrix.nats <- mutinformation(df.disc) cor.matrix.nats <- cbind.data.frame(row.names(cor.matrix.nats),cor.matrix.nats)

cor.matrix.Spearman <- cor(df.num, method="spearman", use="pairwise.complete.obs") cor.matrix.Spearman <- cbind.data.frame(row.names(cor.matrix.Spearman), cor.matrix.Spearman)

However, can I rely on the fact that discretize will never assign -2147483647 as a real bin number associated to data instead of to NA? Also, I'm still curious why discretize behaves like that. There is probably a good reason for it that I just can't see...

MarkH
  • 197