Approximate Gower's dissimilarity measure

Question

I have a very large dataset with mixed-type variables. When I apply the Gower's dissimilarity measure to obtain the distance matrix, it is running out of memory. Due to the large size of the data, it's no use if I increase the memory size by memory.limit(size = xxx).

So I am wondering, is there some effective approach to approximate the gower's dissimilarity measure for large datasets? Thank you!

What is the cause of your run out of memory? Too many objects or too many variables? — ttnphns, Jul 24 '23 at 18:16
I think it's due to the large sample size of the dataset. There are too many observations in the dataset. The number of variables is not big. — Phoebe, Jul 27 '23 at 02:36
There is no "approximate" Gower coefficient. Just do your cluster analysis on random subsample(s) of the data. Say, of n=500 or 1000 objects. If there are clusters, they must show in a subsample of moderate size. Also, greedy methods such as hierarchical clustering are often worse on huge samples than on moderate ones (see the last point here). — ttnphns, Jul 27 '23 at 07:16

score 1 · Answer 1 · answered Jul 25 '23 at 00:15

1

As it stands it seems the main issue is that we are running out of memory because we are computing all the pair-wises distances possible. The obvious things here are:

use a lower triangular matrix for storage,
use float16 or float32 instead of the native float64,
impose sparsity such that we only calculate distances from the $k$-nearest neighbours found using a ball tree.

answered Jul 25 '23 at 00:15

usεr11852

44,125

Oh, that's a smart way to store the lower triangular matrix. But I need to compute silhouette score using the distance matrix. Is it okay to use a lower triangular one instead? Thanks! – Phoebe Jul 27 '23 at 02:35
Of course, it is as the distance matrix is symmetric by definition in the case of Gower's distance. (If it wasn't we would need both the upper and lower triangular parts of it.) – usεr11852 Jul 27 '23 at 10:11
I am using fpc::cluster.stats(dis_mat, cluster) to get the silhouette scores. If dis_mat is lower triangular matrix, what about the cluster? I just tried this but it failed. The cluster is taken from the whole dataset for each obs. – Phoebe Jul 28 '23 at 16:24
Apologies, I don't know how fpc::cluster.stats does its internal calculations. As mentioned all pair-wise distances can be defined fully in a triangular matrix if the original distance matrix is symmetric. Maybe you need to reimplement some of fpc::cluster.stats steps. – usεr11852 Jul 28 '23 at 23:18

Approximate Gower's dissimilarity measure

1 Answers1