Clustering a correlation matrix

Question

I have a correlation matrix which states how every item is correlated to the other item. Hence for a N items, I already have a N*N correlation matrix. Using this correlation matrix how do I cluster the N items in M bins so that I can say that the Nk Items in the kth bin behave the same. Kindly help me out. All item values are categorical.

Thanks. Do let me know if you need any more information. I need a solution in Python but any help in pushing me towards the requirements will be a big help.

I dont need a hierarchical clustering for my problem. Just need to tell which items behave likewise. — Abhishek093, Feb 19 '15 at 11:49
FYI, this problem is called bi-clustering. A demo of it can be found at http://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_coclustering.html — chanp, Feb 18 '16 at 14:20

Rodin · Accepted Answer · 2018-10-22T09:47:53.120

Looks like a job for block modeling. Google for "block modeling" and the first few hits are helpful.

Say we have a covariance matrix where N=100 and there are actually 5 clusters: Initial covariance matrix

What block modelling is trying to do is find an ordering of the rows, so that the clusters become apparent as 'blocks': Optimized covariance matrix order

Below is a code example that performs a basic greedy search to accomplish this. It's probably too slow for your 250-300 variables, but it's a start. See if you can follow along with the comments:

import numpy as np
from matplotlib import pyplot as plt

# This generates 100 variables that could possibly be assigned to 5 clusters
n_variables = 100
n_clusters = 5
n_samples = 1000

# To keep this example simple, each cluster will have a fixed size
cluster_size = n_variables // n_clusters

# Assign each variable to a cluster
belongs_to_cluster = np.repeat(range(n_clusters), cluster_size)
np.random.shuffle(belongs_to_cluster)

# This latent data is used to make variables that belong
# to the same cluster correlated.
latent = np.random.randn(n_clusters, n_samples)

variables = []
for i in range(n_variables):
    variables.append(
        np.random.randn(n_samples) + latent[belongs_to_cluster[i], :]
    )

variables = np.array(variables)

C = np.cov(variables)

def score(C):
    '''
    Function to assign a score to an ordered covariance matrix.
    High correlations within a cluster improve the score.
    High correlations between clusters decease the score.
    '''
    score = 0
    for cluster in range(n_clusters):
        inside_cluster = np.arange(cluster_size) + cluster * cluster_size
        outside_cluster = np.setdiff1d(range(n_variables), inside_cluster)

        # Belonging to the same cluster
        score += np.sum(C[inside_cluster, :][:, inside_cluster])

        # Belonging to different clusters
        score -= np.sum(C[inside_cluster, :][:, outside_cluster])
        score -= np.sum(C[outside_cluster, :][:, inside_cluster])

    return score


initial_C = C
initial_score = score(C)
initial_ordering = np.arange(n_variables)

plt.figure()
plt.imshow(C, interpolation='nearest')
plt.title('Initial C')
print 'Initial ordering:', initial_ordering
print 'Initial covariance matrix score:', initial_score

# Pretty dumb greedy optimization algorithm that continuously
# swaps rows to improve the score
def swap_rows(C, var1, var2):
    '''
    Function to swap two rows in a covariance matrix,
    updating the appropriate columns as well.
    '''
    D = C.copy()
    D[var2, :] = C[var1, :]
    D[var1, :] = C[var2, :]

    E = D.copy()
    E[:, var2] = D[:, var1]
    E[:, var1] = D[:, var2]

    return E

current_C = C
current_ordering = initial_ordering
current_score = initial_score

max_iter = 1000
for i in range(max_iter):
    # Find the best row swap to make
    best_C = current_C
    best_ordering = current_ordering
    best_score = current_score
    for row1 in range(n_variables):
        for row2 in range(n_variables):
            if row1 == row2:
                continue
            option_ordering = best_ordering.copy()
            option_ordering[row1] = best_ordering[row2]
            option_ordering[row2] = best_ordering[row1]
            option_C = swap_rows(best_C, row1, row2)
            option_score = score(option_C)

            if option_score > best_score:
                best_C = option_C
                best_ordering = option_ordering
                best_score = option_score

    if best_score > current_score:
        # Perform the best row swap
        current_C = best_C
        current_ordering = best_ordering
        current_score = best_score
    else:
        # No row swap found that improves the solution, we're done
        break

# Output the result
plt.figure()
plt.imshow(current_C, interpolation='nearest')
plt.title('Best C')
print 'Best ordering:', current_ordering
print 'Best score:', current_score
print
print 'Cluster     [variables assigned to this cluster]'
print '------------------------------------------------'
for cluster in range(n_clusters):
    print 'Cluster %02d  %s' % (cluster + 1, current_ordering[cluster*cluster_size:(cluster+1)*cluster_size])

Isnt that technique used for Social Networks clustering? Will that be relevant here? Does it make sense to use that correlation matrix as a distance matrix? — Abhishek093, Feb 19 '15 at 11:45
Okay. I saw through the first few links. I still dont know how this will help me solve my problem. — Abhishek093, Feb 19 '15 at 11:52
I am gonna check it out now. I will let you know if it fit my problem. Thank you so much. — Abhishek093, Feb 20 '15 at 06:52
Is there a way to improve this code to handle the situation when the sizes of clusters are not known in advance? — Sergei Tikhomirov, May 31 '18 at 17:02
I assume it's version depended, but I got a IndexError: arrays used as indices must be of integer (or boolean) type, since the indices used when calculating the score are floats by default. This can be fixed by converting inside_cluster and outside_cluster to int. — Mitchell van Zuylen, Oct 19 '18 at 09:36

score 8 · Answer 2 · edited May 16 '20 at 13:00

Have you looked at hierarchical clustering? It can work with similarities, not only distances. You can cut the dendrogram at a height where it splits into k clusters, but usually it is better to visually inspect the dendrogram and decide on a height to cut.

Hierarchical clustering is also often used to produce a clever reordering for a similarity matrix visualization as seen in the other answer: it places more similar entries next to each other. This can serve as a validation tool for the user, too!

score 2 · Answer 3 · answered Sep 14 '16 at 07:41

2

Have you looked into correlation clustering? This clustering algorithm uses the pair-wise positive/negative correlation information to automatically propose the optimal number of clusters with a well defined functional and a rigorous generative probabilistic interpretation.

answered Sep 14 '16 at 07:41

Shai

298

The promoted Wikipedia article: Correlation clustering provides a method for clustering a set of objects into the optimum number of clusters without specifying that number in advance. Is that a definition of the method? If yes it is strange because there exist other methods to automatically suggest the number of clusters, and also, why then is it called "correlation". – ttnphns Sep 14 '16 at 08:44
@ttnphns (1) it is called "correlation clustering" because it expects as input a pair-wise correlation matrix (see the seminal work of Bansal, N.; Blum, A.; Chawla, S. (2004). "Correlation Clustering". Machine Learning. 56: 89). – Shai Sep 14 '16 at 08:53
@ttnphns regarding the "optimal number of clusters": you are right about the fact that "optimal" is ambiguous, "optimal" under what measure? As for correlation clustering, if you accept the generative model proposed in Bagon & Galun "Large Scale Correlation Clustering", then the method does outputs the optimal number. – Shai Sep 14 '16 at 08:55
Shai, it appears that you are one of the inventors of the method. I would encourage you to give a more unwrapped answer presenting it - if you have time and desire. Specifically, one wants to know how the method is placed among some well established ones, such as k-means or hierarhical. Note also correlation is easily convertible to euclidean distance (with any standard clustering method applicable afterwards), - knowing that fact/trick, what things then does your method allows which that "trick" allows not? Write about it. (Thanks in advance!) – ttnphns Sep 14 '16 at 09:18
@ttnphns I'm afraid this falls under "too broad": the theoretic part of the paper I cited here covers your questions. – Shai Sep 14 '16 at 09:45
1

I hope it covers. I just wanted to say that it is always a good idea to give a bit more details in an answer posted on this site, especially when a method is rather new and when one knows what to say, being an inventor. :-) No, isn't "too broad". – ttnphns Sep 14 '16 at 09:50

score 1 · Answer 4 · answered Aug 07 '17 at 12:48

1

I would filter at some meaningful (statistical significance) threshold and then use the dulmage-mendelsohn decomposition to get the connected components. Maybe before you can try to remove some problem like transitive correlations (A highly correlated to B, B to C, C to D, so there is a component containing all of them, but in fact D to A is low). you can use some betweenness based algorithm. It's not a biclustering problem as someone suggested, as the correlation matrix is symmetrical and therefore there is no bi-something.

answered Aug 07 '17 at 12:48

user2843263

21

This answer does not quite explain how to set the suggested thresholds, which IMO seems arbitrary. Additionally, as this question is two years old, and an answer with a couple of upvotes has already been accepted, you might want to elaborate on the already existing information. – IWS Aug 07 '17 at 14:57
+1 I think the name of the algorithm is the very useful piece of information in this answer. – Roger V. Mar 24 '21 at 12:45

Clustering a correlation matrix

4 Answers4