28

I want to create heatmaps based upon cosine dissimilarity.

I'm using R and have explored several packages, but cannot find a function to generate a standard cosine dissimilarity matrix. The built-in dist() function doesn't support cosine distances, also within the package arules there is a dissimilarity() function, but it only works on binary data.

Can anybody recommend a library? Or demonstrated how to calculate cosine dissimilarity within R?

Brad
  • 600
Greg Slodkowicz
  • 435
  • 1
  • 5
  • 10

5 Answers5

36

As @Max indicated in the comments (+1) it would be simpler to "write your own" than to spend time looking for it somewhere else. As we know, the cosine similarity between two vectors $A,B$ of length $n$ is

$$ C = \frac{ \sum \limits_{i=1}^{n}A_{i} B_{i} }{ \sqrt{\sum \limits_{i=1}^{n} A_{i}^2} \cdot \sqrt{\sum \limits_{i=1}^{n} B_{i}^2} } $$

which is straightforward to generate in R. Let X be the matrix where the rows are the values we want to compute the similarity between. Then we can compute the similarity matrix with the following R code:

cos.sim <- function(ix) 
{
    A = X[ix[1],]
    B = X[ix[2],]
    return( sum(A*B)/sqrt(sum(A^2)*sum(B^2)) )
}   
n <- nrow(X) 
cmb <- expand.grid(i=1:n, j=1:n) 
C <- matrix(apply(cmb,1,cos.sim),n,n)

Then the matrix C is the cosine similarity matrix and you can pass it to whatever heatmap function you like (the only one I'm familiar with is image()).

Macro
  • 44,826
  • Thanks, this is helpful. Actually, I don't want to plot the matrix itself but rather have a distance function for clustering of another heatmap that I have. – Greg Slodkowicz Jul 05 '12 at 12:01
  • @GregSlodkowicz, OK well perhaps you can pass this matrix to the function you're using. Also, if you've found this answer helpful please consider an upvote (or accepting the answer if you consider it definitive) :) – Macro Jul 05 '12 at 12:26
  • Great, thanks to your reply and ttnphns's comment I was able to do what I want. Now I would like to have a different metric when clustering rows than when clustering columns but maybe that's pushing it... – Greg Slodkowicz Jul 07 '12 at 10:20
  • Apparently I don't have enough points to be able to comment. I just wanted to offer a slightly modified version of Macro's nice answer. Here it is. # ChirazB's version of cos.sim() by Macro

    where S = X %*% t(X)

    cos.sim.2 <- function(S,ix) { i <- ix[1] j <- ix[2] return( S[i,j]/sqrt(S[i,i]S[j,j]) ) } #test X <- matrix(rnorm(20),nrow=5,ncol=4) S <- X%%t(X) n <- nrow(X) idx.arr <- expand.grid(i=1:n, j=1:n) C <- matrix(apply(idx.arr,1,cos.sim,X),n,n) C2 <- matrix(apply(idx.arr,1,cos.sim.2,S),n,n) I don't like global variable, that's why I included S as a parameter.

    – Chiraz BenAbdelkader Jan 11 '15 at 17:23
  • 1
    it should be sqrt(sum(A^2))*sqrt(sum(B^2)) instead of sqrt(sum(A^2)*sum(B^2)) – Marcin Aug 25 '21 at 09:02
  • Why? Does it make any difference? – WJH Jun 12 '23 at 12:47
30

Many answers here are computationally inefficient, try this;


For cosine similarity matrix

Matrix <- as.matrix(DF)
sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
sim <- sim %*% t(sim)

Convert to cosine dissimilarity matrix (distance matrix).

D_sim <- as.dist(1 - sim)
Brad
  • 600
11

You can use the cosine function from the lsa package:
http://cran.r-project.org/web/packages/lsa

vonjd
  • 6,146
6

The following function might be useful when working with matrices, instead of 1-d vectors:

# input: row matrices 'ma' and 'mb' (with compatible dimensions)
# output: cosine similarity matrix

cos.sim=function(ma, mb){
  mat=tcrossprod(ma, mb)
  t1=sqrt(apply(ma, 1, crossprod))
  t2=sqrt(apply(mb, 1, crossprod))
  mat / outer(t1,t2)
}
Kimsche
  • 61
1

Ramping up some of the previous code (from @Macro) on this issue, we can wrap this into a cleaner version in the following:

df <- data.frame(t(data.frame(c1=rnorm(100),
                              c2=rnorm(100),
                              c3=rnorm(100),
                              c4=rnorm(100),
                              c5=rnorm(100),
                              c6=rnorm(100))))

#df[df > 0] <- 1
#df[df <= 0] <- 0



apply_cosine_similarity <- function(df){
  cos.sim <- function(df, ix) 
  {
    A = df[ix[1],]
    B = df[ix[2],]
    return( sum(A*B)/sqrt(sum(A^2)*sum(B^2)) )
  }   
  n <- nrow(df) 
  cmb <- expand.grid(i=1:n, j=1:n) 
  C <- matrix(apply(cmb,1,function(cmb){ cos.sim(df, cmb) }),n,n)
  C
}
apply_cosine_similarity(df)

Hope this helps!

bmc
  • 121