1

I have a very large dataset with mixed-type variables. When I apply the Gower's dissimilarity measure to obtain the distance matrix, it is running out of memory. Due to the large size of the data, it's no use if I increase the memory size by memory.limit(size = xxx).

So I am wondering, is there some effective approach to approximate the gower's dissimilarity measure for large datasets? Thank you!

ttnphns
  • 57,480
  • 49
  • 284
  • 501
Phoebe
  • 41
  • 3
  • What is the cause of your run out of memory? Too many objects or too many variables? – ttnphns Jul 24 '23 at 18:16
  • I think it's due to the large sample size of the dataset. There are too many observations in the dataset. The number of variables is not big. – Phoebe Jul 27 '23 at 02:36
  • There is no "approximate" Gower coefficient. Just do your cluster analysis on random subsample(s) of the data. Say, of n=500 or 1000 objects. If there are clusters, they must show in a subsample of moderate size. Also, greedy methods such as hierarchical clustering are often worse on huge samples than on moderate ones (see the last point here). – ttnphns Jul 27 '23 at 07:16

1 Answers1

1

As it stands it seems the main issue is that we are running out of memory because we are computing all the pair-wises distances possible. The obvious things here are:

  1. use a lower triangular matrix for storage,
  2. use float16 or float32 instead of the native float64,
  3. impose sparsity such that we only calculate distances from the $k$-nearest neighbours found using a ball tree.
usεr11852
  • 44,125
  • Oh, that's a smart way to store the lower triangular matrix. But I need to compute silhouette score using the distance matrix. Is it okay to use a lower triangular one instead? Thanks! – Phoebe Jul 27 '23 at 02:35
  • Of course, it is as the distance matrix is symmetric by definition in the case of Gower's distance. (If it wasn't we would need both the upper and lower triangular parts of it.) – usεr11852 Jul 27 '23 at 10:11
  • I am using fpc::cluster.stats(dis_mat, cluster) to get the silhouette scores. If dis_mat is lower triangular matrix, what about the cluster? I just tried this but it failed. The cluster is taken from the whole dataset for each obs. – Phoebe Jul 28 '23 at 16:24
  • Apologies, I don't know how fpc::cluster.stats does its internal calculations. As mentioned all pair-wise distances can be defined fully in a triangular matrix if the original distance matrix is symmetric. Maybe you need to reimplement some of fpc::cluster.stats steps. – usεr11852 Jul 28 '23 at 23:18