Dimension Reduction and Clustering on Big Dataset

Question

I am currently working on a marketing analytics project and my dataset is rather large: 400K rows and 70 features which contains both continuous and binary variables. I've tried using PCA and incorporating the results into a Kmeans clustering algorithm but

There doesn't appear to be a clear optimal number of clusters when I utilize the elbow method and processing time for silhouette scores analysis is really long
2.I've seen that computing a gowers matrix is best for datasets with mixed datatypes, however I do not have enough storage using Python to compute a 400Kx400K matrix.

Are there any suggestions? Perhaps there is a sampling method, but I am not sure how to then apply the clusters created in the sample data to the rest of the dataset. Any help is appreciated!!!

Although there exist some well-known clustering methods which are capable to process huge data, each has its limitations in other respect (for example, k-means assumes scale data, more or less round clusters of approximately the same variances). But you always can opt doing a more flexible procedures, well suited for your data, on several smaller (say, few thousand objects each) random subsamples from your dataset. Because the number of clusters is expected to be relatively small, they should show up in each random subsample. If there really exist clusters in the data. — ttnphns, Nov 01 '22 at 21:00

score 1 · Answer 1 · answered Oct 27 '22 at 20:32

1

I think you could choose n samples randomly(and taking care of how well they could symbol the whole dataset). After getting the clusters, you could classify the remaining samples to clusters based on their distances to different clusters.

answered Oct 27 '22 at 20:32

no-magic

81

I was thinking this as well, but when I get n samples, do I think perform a dimension reduction method (ie: PCA, or compute a gower's matrix) and then fit into K-means? – BigDawg007 Oct 27 '22 at 21:16
you could first use pca fit the data, then transform them. Or you could just train the pca on the samples, then using the trained model to transform other data. sklearn would be helpful. or you could just get the pca coef and do it yourself, I think the transform step of pca is similar to matrix multiplication. – no-magic Oct 28 '22 at 15:19
Right now, I trained and fit the PCA on the full dataset. Then I am sampling from this dataset and using PSI scores to ensure that the sample does not deviate from the population very much. After this step, I am going to fit the model on the sample then predict clusters with the full dataset. Would this methodology be valid? – BigDawg007 Oct 28 '22 at 18:56
I think it is great. PCA trained on the all data is always more robust. – no-magic Oct 30 '22 at 04:24
Thanks! I actually opted to do UMAP embedding as it seems to be preferable for a mix of binary and continuous variables – BigDawg007 Nov 03 '22 at 12:47

score 1 · Answer 2 · answered Oct 28 '22 at 10:23

In my field, it is very common to have datasets which could be treated as a few million rows of data and a couple hundred features. Commonly used dimensionality reduction techniques include:

PCA. The first thing in the book indeed, has a ton of drawbacks but works well if you do not have to deal with covariate shift when transferring your model to new data and plan on using something like SVM later. PCA + SVM w/ RBF is a very dirty solution which tends to produce pretty good results no one can reproduce on new data nor interpret. On the same note - PCA cutoff is a bit speculative more often than not. I would suggest using it for an exploratory stage, throwing away features that are supposedly wiggling around the noise threshold and looking only at the first few which are undoubtedly important. Once you get a feel for this dataset, move on to something else.
GMM and ICA family, projection pursuit. These can help if you know you are dealing with latent variables linearly mixed together, but I do not think they could be applied well to categorical data and I am not the only one. You could always split your features into continuous and categorical and deal with them separately for a start. If all your categorical data is ordinal, especially binary, you can treat it as continuous - albeit with some caveats. It may make sense to treat a variable representing e.g. income brackets as a continuous variable with very poor resolution, this is routinely being done and it is also a low-hanging fruit for all the domain experts: come up with a better transformation of categorical data into numbers, boom, the model has improved. Personally, of exploratory models I probably prefer spectral clustering, but it does not scale very well.
Supervised learning. Enables you to use a wide range of tools such as metric-based clustering (Mahalanobis distance works well for high-dimensional continuous data; GMMs use it under the hood) or running something like random forests over your data and looking at feature importances later. If you are dealing with anomaly detection or otherwise important minority classes, sampling becomes a hard problem on its own.
(bonus) "Manually" tweaking some of the above: if you can not fit the entire dataset in memory, you could still potentially use memmaps, factorize matrices, build estimators approximating the "true values" stochastically, treating incoming samples as a stream, and so forth. A good and valuable approach, but typically reserved for when you are reasonably sure this solution works.

Feature embedding is an entire field of study, and it is heavily data-dependent. I would argue that blind data mining is largely a thing of the past, and you need to bring some domain knowledge along, after all.

Dimension Reduction and Clustering on Big Dataset

2 Answers2