I am new to running commands in R in parallel and I am trying to run the dist2list function from the metagMisc R package in parallel. When I run it in parallel, I end up with a list, however, the default output of the dist2list function is not a list, but a data-frame with 3 columns named: row, col, and value. However, in parallel, each row is it's own item in one big list. I have been looking at alternative ways to get around this issue when I run in parallel.
Here is some reproducible data
rand <- as.data.frame(as.matrix(as.dist(matrix(runif(100, -1, 1), nrow = 50, ncol = 50))))
colnames(rand) <- paste("ASV", 1:ncol(rand), sep = "_") ; row.names(rand) <- names(rand)
At this point, I would just run the dist2list command:
library(metagMisc)
rand_list <- dist2list(as.dist(rand), tri = TRUE)
However, of course my data isn't a simple 50 x 50 matrix, and the result of this command for my data takes over 20 minutes to run in R and I get over 420,000,000 rows. Attempting to do anything after that with the resulting output then takes over an hour or more to run. So, I need to be efficient and run this in parallel.
I have taken inspiration from this post: Parallelize r script, however a simple approach such as this does not work as it does not recognize the parallel flag.
library(doParallel)
cores <- detectCores() - 1
cl <- makeCluster(cores)
registerDoParallel(cl)
rand_list <- system.time(dist2list(as.dist(rand), tri = TRUE), parallel = TRUE)
I have come up with this rudimentary function and it does work, but as you can see, 'results' is a large list, which is what I need to avoid. Taking inspiration from this post: Convert a list to a data frame, I used the Reduce function to unlist 'results' and get the result I am somewhat looking for.
func <- function(x) {
dist2list(as.dist(x), tri = TRUE)
}
system.time({
results <- mclapply(rand, func, mc.cores = 7)
df2 <- data.frame(Reduce(rbind, results))
})
However, using the Reduce function is outside of the mc.cores flag, so it does not use the allocated 7 cores to run, but one. This step takes an enormous amount of time with my data.
Secondly, in df2, as you can see, the row and col columns have lost their names. The column named row in df2 is now an index of the row number associated with the rand data frame and the same with the col column in df2.
How am I able to use the parallel portion here with the Rand function so I can unlist the rand list? Also, I need the names of the respective row.names and col.names of rand here too. Any way to efficiently do this?
Better yet - any way to run this in parallel so that the end result of running the command is the same default output of dist2list, which you can see with my rand_list example?