What method should be used if the clusters contains different classes?

Question

Assume that you having $N$ clusters. Each cluster have multiple classes. So we know the class ID for every major clusters, but not the class ID for the data points inside the major clusters.

Each colour is its own class ID. E.g red can be class ID 1 and blue can be class ID 2. Assume that you are using Support Vector Machine to classify each major clusters. But in this case, SVM cannot classify each data point.

How should this be solved If I got $X$ number of points and I don't know which class they belong to, but I know which major cluster they fit in.

I found the major clusters with K-means clustering and then I'm using SVM with a linear kernel to find the mathematical expression how to classify each data point into the major clusters, but not succeed to find the class ID of the individual data points.

Do you have any suggestion?

If it's correct that the information being clustered into major clusters is categorical, discrete or nominal, which it sounds like it is, then k-means is NOT the appropriate algorithm. K-means is intended for use with continuously scaled information, ideally at the interval scale level. A more appropriate routine for categorical information or mixtures of scales is latent class clustering as originally developed by academics like Clifford Clogg, James Coleman, William Dillon and James Heckman. If the data is massive, then see Dunson's recent work on Bayesian Pyramids. — user78229, Nov 02 '23 at 11:54

Dave · Accepted Answer · 2023-11-02T10:41:58.430

4

The trouble with your example is that these clusters are not particularly informative about the color you aim to predict. If one cluster were mostly blue, another mostly yelllow, etc, then the cluster (or the location in 2D space) would be informative about color/category. If your real work looks like the posted example, you probably do not have adequate data to make accurate predictions.

Distinct outcomes need to be different in the features for them to be distinguished from each other. Your example lacks this characteristic.

If there is a third feature (interpret it as being in and out of the image) and the colors in each cluster correspond to how far out of the image the points stick (might ot be the same for each color in each cluster), then knowing both the cluster (or just the $x$-$y$ coordinates) and the height might allow you to improve performance.

edited Nov 02 '23 at 10:41

answered Nov 01 '23 at 16:37

Dave

62,186

The last two images in my answer here seem related. – Dave Nov 01 '23 at 17:08
Thank you so much. You this problem is impossible to solve? Do you recommend for example UMAP or t-sne for reducing the dimension before linear SVM classification? – euraad Nov 01 '23 at 17:50
@euraad If nothing in the feature space differentiates one category from another, then there isn't much you can do. UMAP and t-SNE are nice tools for reducing the dimension to something that can be visualized, but if you just have two features like you have in your example, you can plot the entirety of the data set and examine, visually, if there is anything that distinguishes one color from any others. Your example seems to lack that. Perhaps your real data do not. // My answer at the link I gave is worth reading. This questions is, arguably, a duplicate in disguise. – Dave Nov 01 '23 at 17:53
Thank you so much. I will think a while which method I could use. My data is just typical image classification with object detection. – euraad Nov 01 '23 at 17:54
Then you have way more features that just two. Why not use those features instead of the cluster membership? Clusters don't have to relate to the outcome categories of interest. The outcome categories of interest do not even make an appearance in the clustering, and the clustering will be the same no matter what outcome you aim to predict. – Dave Nov 01 '23 at 17:58
Well, the data of the images comes from interest points e.g corners or edges. One edge can be found in a lot of different images. The edge can be described as a binary feature e.g 0b110101010111. So my goal is to classify on that type of data. This is a simple 5 minute tutorial how to classify images with object detection. Very simple with K-NN, but very CPU expencive. https://www.youtube.com/watch?v=25GkgxClSaU&ab_channel=CyrillStachniss – euraad Nov 01 '23 at 18:04
I want to use SVM instead of K-nn. – euraad Nov 01 '23 at 18:05
What keeps you from running SVM? I don't follow why the clustering is useful. – Dave Nov 01 '23 at 18:06
I'm writing C code from scratch, real ANSI C (C89), and I have built 1-layer linear Neural Network with SVM. So when I'm using SVM, then I'm often using multi-class SVM with a kernel. Sometimes a kernel, depending on the data. – euraad Nov 01 '23 at 18:08
Most people using Python, R or Matlab for machine learning....but I'm using C. – euraad Nov 01 '23 at 18:08
That doesn't explain to me why you want to use this clustering in your pipeline. However, even if you have a reason, the fact remains that, if your clusters are highly diverse in the outcome categories they contain, then they are not informative of the category to which a new point in that cluster will belong. You need something to differentiate the yellows from the pinks from the greens. – Dave Nov 01 '23 at 18:11
The reason why I'm using SVM is because SVM gives me a model back, multi-class or single class, but K-NN requries that I need to save the data so I can compare the unknown data and predict the unknown data with the saved data, e.g L2 norm or hamming distance in this case. – euraad Nov 01 '23 at 18:14
1

I understand that I need to find more dimension that can separate the data from each other. – euraad Nov 01 '23 at 18:15
But why are you clustering at all? – Dave Nov 01 '23 at 18:16
I'm clustering because my data belongs to different classes. The ideal is to have like...one cluster is one specific class. But that's a dream. – euraad Nov 01 '23 at 18:18
I recommend posting a new question about using clustering in classification. – Dave Nov 01 '23 at 18:19
Thank you. I will do that. – euraad Nov 01 '23 at 18:26
By the way! I solve the issue....K-means clustering -> Linear SVM two times. Done.... – euraad Nov 03 '23 at 23:20
Let us continue this discussion in chat. – Dave Nov 03 '23 at 23:31

score 4 · Answer 2 · answered Nov 01 '23 at 16:48

4

This problem violates the so-called cluster hypothesis which states that points in the same cluster should generally belong to the same class. Here the clustering appears uninformative for determining the actual class of each individual. We have no useful measure of similarity of samples that actually belong to the same class, as the clustering seems to capture nothing useful about the classification under study. A clustering is useful when it groups items in some logical way according to similarities you actually care about. This is not a useful clustering in the context of this classification, as it does not group items of the same class together. It's a fundamental disconnect between the clustering and the classes, one is not useful for predicting the other.

answered Nov 01 '23 at 16:48

Nuclear Hoagie

9,297

So you are saying that this is not possible to make predictions if the same clusters shares the same classes? – euraad Nov 01 '23 at 17:38
1

@euraad It doesn't look like it, although there's no hard cutoff between "possible" and "not possible", just a decreasing ability to get a good result. The data you show looks very close to "not possible", as there is virtually no association whatsoever between the clustering and the classes you actually care about. It'd be like grouping people by the first letter of their last name, and trying to predict their favorite genre of movie - there just isn't any useful relationship between the two to leverage. – Nuclear Hoagie Nov 01 '23 at 17:45
1

@euraad You can always make predictions, but what you've shown in your example makes it seem like those predictions cannot consistently be accurate. For instance, if you know that a new point belongs to that cluster on the right, doesn't it seem like there is a one-in-seven chance that it belongs to each color group? That seems to be about all you can say. – Dave Nov 01 '23 at 17:48
I understand! Well, perhaps I need to add some dimension reduction. If you wonder what the data is, it's binary data e.g 0b110100101001001. I want to classify this. The data comes from ORB or SIFT classification for images. The most common tool is to use K-NN with sum(XOR(Class Image data, Unknown Image data)). If the sum is small, then the image belongs to a specific class. – euraad Nov 01 '23 at 17:52
1

@euraad You could possibly do very slightly better than random guessing by recognizing that being in the top two clusters gives a slightly increased probability of being in the brown class, while being in the bottom right cluster implies that it's not the yellow class. But with almost all classes observed in all clusters, even if the performance is "better than random", it's still likely in the range of "not useful". – Nuclear Hoagie Nov 01 '23 at 17:53
To be honest, I could classify with K-means + linear SVM and then make a histogram how often e.g. cluster 1 and cluster 2 and cluster 3 and cluster N gets its points. Then I use K-NN to classify the histogram. But the problem with that method, is that is very sensitive for noise. – euraad Nov 01 '23 at 17:56

What method should be used if the clusters contains different classes?

2 Answers2