191

I want to calculate the cosine similarity between two lists, let's say for example list 1 which is dataSetI and list 2 which is dataSetII.

Let's say dataSetI is [3, 45, 7, 2] and dataSetII is [2, 54, 13, 15]. The length of the lists are always equal. I want to report cosine similarity as a number between 0 and 1.

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]

def cosine_similarity(list1, list2):
  # How to?
  pass

print(cosine_similarity(dataSetI, dataSetII))
Robin De Schepper
  • 3,686
  • 2
  • 28
  • 47
Rob Alsod
  • 2,195
  • 3
  • 18
  • 18
  • 48
    I love the way SO crushed the soul out of this homework question to make it a nice general reference one. OP says "**I cannot use *numpy***, I must go the pedestrian math way", and top answer goes "you should try scipy, it uses numpy". SO mechanics grant a gold badge to the popular question. – Nikana Reklawyks Sep 20 '16 at 03:46
  • 4
    Nikana Reklawyks, that is an excellent point. I've had that problem more and more often with StackOverflow. And I've had several questions marked as "duplicates" of some earlier question, because the moderators did not take the time to understand what made my question unique. – LRK9 Nov 10 '16 at 22:07
  • @NikanaReklawyks, this is great. Look at his profile, it tells the story of one of SO's top .01% contributors, you know? – Nathan Chappell Aug 04 '20 at 09:15
  • Well, I cleaned up the question. Now it's a general purpose question, it still shows no research effort but hey *shrugs* – Robin De Schepper Apr 03 '21 at 08:42

16 Answers16

243

You should try SciPy. It has a bunch of useful scientific routines for example, "routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices." It uses the superfast optimized NumPy for its number crunching. See here for installing.

Note that spatial.distance.cosine computes the distance, and not the similarity. So, you must subtract the value from 1 to get the similarity.

from scipy import spatial

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
Riebeckite
  • 436
  • 3
  • 11
charmoniumQ
  • 4,814
  • 4
  • 27
  • 48
  • Why does the example given in scipy.spatial.distance.cosine "distance.cosine([1, 0, 0], [0, 1, 0])" returns "1.0"? (I think this should be zero, no?) – Z.LI Sep 14 '21 at 11:56
  • 1
    @Z.LI No, since it is the distance and not the similarity, 1.0 is correct. Similarity is 1-d, which is obviously zero in this case. – Daniello Sep 15 '21 at 08:55
241

another version based on numpy only

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))
dontloo
  • 8,717
  • 4
  • 24
  • 47
  • 6
    Very clear as the definition, but maybe `np.inner(a, b) / (norm(a) * norm(b))` is better to understand. `dot` can get the same result as `inner` for vectors. – Belter Jul 03 '17 at 10:44
  • 30
    FYI this solution is significantly faster on my system than using `scipy.spatial.distance.cosine`. – Ozzah Apr 17 '19 at 23:39
  • 2
    @ZhengfangXin cosine similarity ranges from -1 to 1 by definition – dontloo Sep 17 '19 at 03:01
  • 6
    Even shorter: `cos_sim = (a @ b.T) / (norm(a)*norm(b))` – Union find Dec 03 '19 at 23:57
  • This is by far the fastest approach compared to others. – Jason Youn Apr 26 '20 at 08:51
  • 1
    As noted below, this is far more performant for smaller arrays, but the improvements tend to taper off as the arrays get bigger and bigger. – Nathan Chappell Aug 04 '20 at 08:54
  • @dontloo This gives similarity or distance? should we subtract it from 1 to get the similarity or does this give similarity by itself? – saichand Dec 28 '20 at 09:27
  • 1
    @saichand `cos_sim` is nothing but taking `normalized dot product` of two vectors. You can think it as taking dot product of the two unit vectors corresponding to each of these two vectors (representing `dot()` with a `.`: `(a.b)/(norm(a)*norm(b)) = (a/norm(a)).(b/norm(b)) = a_unit_vec . b_unit_vec`). As cos(angle) increases when the (smaller) angle between two vectors decreases, in other words, when two vectors become more and more similar, the `cos_sim = dot(a, b)/(norm(a)*norm(b))` will increase. So it is giving you the similarity. For distance, you have to subtract it from `1`. – hafiz031 May 24 '21 at 19:24
110

You can use cosine_similarity function form sklearn.metrics.pairwise docs

In [23]: from sklearn.metrics.pairwise import cosine_similarity

In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])
Out[24]: array([[-0.5]])
Akavall
  • 76,296
  • 45
  • 192
  • 242
  • 29
    Just a reminder that Passing one dimension arrays as input data is deprecated in sklearn version 0.17, and will raise ValueError in 0.19. – Chong Tang Mar 11 '16 at 14:36
  • 4
    What is the correct way to do this with sklearn given this deprecation warning? – Elliott Jul 07 '16 at 20:42
  • 4
    @Elliott one_dimension_array.reshape(-1,1) – bobo32 Dec 08 '16 at 16:45
  • 2
    @bobo32 cosine_similarity(np.array([1, 0, -1]).reshape(-1,0), np.array([-1, -1, 0]).reshape(-1,0)) I guess you mean? But what does that result mean that it returns? Its a new 2d array, not a cosine similarity. – Isbister Mar 02 '17 at 17:06
  • is the cosine metric of sklearn meaning "adjusted cosine similarity" or "cosine similarity"? – SarahData Nov 02 '17 at 09:58
  • 13
    Enclose it with one more bracket `cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])` – Ayush Nov 11 '17 at 18:06
  • @Elliott the ValueError specifies the remedy: Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. – Ender Mar 01 '21 at 20:17
44

I don't suppose performance matters much here, but I can't resist. The zip() function completely recopies both vectors (more of a matrix transpose, actually) just to get the data in "Pythonic" order. It would be interesting to time the nuts-and-bolts implementation:

import math
def cosine_similarity(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)

v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]
print(v1, v2, cosine_similarity(v1,v2))

Output: [3, 45, 7, 2] [2, 54, 13, 15] 0.972284251712

That goes through the C-like noise of extracting elements one-at-a-time, but does no bulk array copying and gets everything important done in a single for loop, and uses a single square root.

ETA: Updated print call to be a function. (The original was Python 2.7, not 3.3. The current runs under Python 2.7 with a from __future__ import print_function statement.) The output is the same, either way.

CPYthon 2.7.3 on 3.0GHz Core 2 Duo:

>>> timeit.timeit("cosine_similarity(v1,v2)",setup="from __main__ import cosine_similarity, v1, v2")
2.4261788514654654
>>> timeit.timeit("cosine_measure(v1,v2)",setup="from __main__ import cosine_measure, v1, v2")
8.794677709375264

So, the unpythonic way is about 3.6 times faster in this case.

Anoyz
  • 7,121
  • 3
  • 29
  • 35
Mike Housky
  • 3,822
  • 1
  • 16
  • 29
  • 2
    What is `cosine_measure` in this case? – MERose Jan 30 '18 at 18:40
  • 1
    @MERose: `cosine_measure` and `cosine_similarity` are simply different implementations of the same calculation. Equivalent to scaling both input arrays to "unit vectors" and taking the dot product. – Mike Housky Mar 09 '18 at 03:20
  • 4
    I would have guessed the same. But it's not helpful. You present time comparisons of two algorithms but present only one of them. – MERose Mar 09 '18 at 12:29
  • @MERose Oh, sorry. `cosine_measure` is the code posted earlier by pkacprzak. This code was an alternative to the "other" all-standard-Python solution. – Mike Housky Mar 10 '18 at 04:41
  • thank you, this is great since it's not using any library and it's clear to understand the math behind it – grepit Nov 07 '18 at 07:19
26

without using any imports

math.sqrt(x)

can be replaced with

x** .5

without using numpy.dot() you have to create your own dot function using list comprehension:

def dot(A,B): 
    return (sum(a*b for a,b in zip(A,B)))

and then its just a simple matter of applying the cosine similarity formula:

def cosine_similarity(a,b):
    return dot(a,b) / ( (dot(a,a) **.5) * (dot(b,b) ** .5) )
Mohammed
  • 261
  • 3
  • 3
17

I did a benchmark based on several answers in the question and the following snippet is believed to be the best choice:

def dot_product2(v1, v2):
    return sum(map(operator.mul, v1, v2))


def vector_cos5(v1, v2):
    prod = dot_product2(v1, v2)
    len1 = math.sqrt(dot_product2(v1, v1))
    len2 = math.sqrt(dot_product2(v2, v2))
    return prod / (len1 * len2)

The result makes me surprised that the implementation based on scipy is not the fastest one. I profiled and find that cosine in scipy takes a lot of time to cast a vector from python list to numpy array.

enter image description here

mckelvin
  • 3,670
  • 1
  • 27
  • 22
  • how are you so sure that this is the fastest? – Jeru Luke Feb 17 '17 at 17:27
  • @JeruLuke I've pasted the link of my benchmark result at very beginning of the answer: https://gist.github.com/mckelvin/5bfad28ceb3a484dfd2a#file-cos_sim-py-L9-L25 – mckelvin Feb 19 '17 at 14:50
9
import math
from itertools import izip

def dot_product(v1, v2):
    return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))

def cosine_measure(v1, v2):
    prod = dot_product(v1, v2)
    len1 = math.sqrt(dot_product(v1, v1))
    len2 = math.sqrt(dot_product(v2, v2))
    return prod / (len1 * len2)

You can round it after computing:

cosine = format(round(cosine_measure(v1, v2), 3))

If you want it really short, you can use this one-liner:

from math import sqrt
from itertools import izip

def cosine_measure(v1, v2):
    return (lambda (x, y, z): x / sqrt(y * z))(reduce(lambda x, y: (x[0] + y[0] * y[1], x[1] + y[0]**2, x[2] + y[1]**2), izip(v1, v2), (0, 0, 0)))
pkacprzak
  • 5,473
  • 1
  • 16
  • 36
  • I tried this code out, and it doesn't seem to work. I tried it with v1 being `[2,3,2,5]`, and v2 being `[3,2,2,0]`. It returns with `1.0`, as if they were exactly the same. Any idea what is wrong? – Rob Alsod Aug 24 '13 at 23:53
  • The fix worked here. Nice job! See below for an uglier but faster approach. – Mike Housky Aug 25 '13 at 02:35
  • How is it possible to adapt this code if the similarity has to be calculated within a matrix and not for two vectors? I thought I take a matrix and the transposed matrix instead of the second vector, bit it doesn't seem to work. – student Aug 19 '16 at 11:59
  • you can use np.dot(x, y.T) to make it simpler – Areza Mar 04 '20 at 13:11
8

You can use this simple function to calculate the cosine similarity:

def cosine_similarity(a, b):
  return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))
Robin De Schepper
  • 3,686
  • 2
  • 28
  • 47
Isira
  • 441
  • 5
  • 5
8

Python code to calculate:

  • Cosine Distance
  • Cosine Similarity
  • Angular Distance
  • Angular Similarity

import math

from scipy import spatial


def calculate_cosine_distance(a, b):
    cosine_distance = float(spatial.distance.cosine(a, b))
    return cosine_distance


def calculate_cosine_similarity(a, b):
    cosine_similarity = 1 - calculate_cosine_distance(a, b)
    return cosine_similarity


def calculate_angular_distance(a, b):
    cosine_similarity = calculate_cosine_similarity(a, b)
    angular_distance = math.acos(cosine_similarity) / math.pi
    return angular_distance


def calculate_angular_similarity(a, b):
    angular_similarity = 1 - calculate_angular_distance(a, b)
    return angular_similarity
Amir Saniyan
  • 12,232
  • 19
  • 86
  • 128
3

You can do this in Python using simple function:

def get_cosine(text1, text2):
  vec1 = text1
  vec2 = text2
  intersection = set(vec1.keys()) & set(vec2.keys())
  numerator = sum([vec1[x] * vec2[x] for x in intersection])
  sum1 = sum([vec1[x]**2 for x in vec1.keys()])
  sum2 = sum([vec2[x]**2 for x in vec2.keys()])
  denominator = math.sqrt(sum1) * math.sqrt(sum2)
  if not denominator:
     return 0.0
  else:
     return round(float(numerator) / denominator, 3)
dataSet1 = [3, 45, 7, 2]
dataSet2 = [2, 54, 13, 15]
get_cosine(dataSet1, dataSet2)
  • 3
    This is a text implementation of cosine. It will give the wrong output for numerical input. – alvas Jan 12 '16 at 10:17
  • Can you explain why you used set in the line "intersection = set(vec1.keys()) & set(vec2.keys())". – Ghos3t Apr 12 '19 at 00:17
  • Also your function seems to be expecting maps but you are sending it lists of integers. – Ghos3t Apr 12 '19 at 00:24
3

Using numpy compare one list of numbers to multiple lists(matrix):

def cosine_similarity(vector,matrix):
   return ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]
sten
  • 6,266
  • 8
  • 35
  • 48
2

If you happen to be using PyTorch already, you should go with their CosineSimilarity implementation.

Suppose you have two n-dimensional numpy.ndarrays, v1 and v2, i.e. their shapes are both (n,). Here's how you get their cosine similarity:

import torch
import torch.nn as nn

cos = nn.CosineSimilarity()
cos(torch.tensor([v1]), torch.tensor([v2])).item()

Or suppose you have two numpy.ndarrays w1 and w2, whose shapes are both (m, n). The following gets you a list of cosine similarities, each being the cosine similarity between a row in w1 and the corresponding row in w2:

cos(torch.tensor(w1), torch.tensor(w2)).tolist()
Ethan Chen
  • 549
  • 1
  • 5
  • 17
  • 1
    I suggest using the functional implementation of the cosine similarity directly (torch.nn.functional.cosine_similarity), instead of instantiating the module implementation and applying the instance of your tensor. – eavsteen Mar 04 '21 at 21:23
1

Another version, if you have a scenario where you have list of vectors and a query vector and you want to compute the cosine similarity of query vector with all the vectors in the list, you can do it in one go in the below fashion:

>>> import numpy as np

>>> A      # list of vectors, shape -> m x n
array([[ 3, 45,  7,  2],
       [ 1, 23,  3,  4]])

>>> B      # query vector, shape -> 1 x n
array([ 2, 54, 13, 15])

>>> similarity_scores = A.dot(B)/ (np.linalg.norm(A, axis=1) * np.linalg.norm(B))

>>> similarity_scores
array([0.97228425, 0.99026919])
Rachit Tayal
  • 1,040
  • 13
  • 17
0

We can easily calculate cosine similarity with simple mathematics equations. Cosine_similarity = 1- (dotproduct of vectors/(product of norm of the vectors)). We can define two functions each for calculations of dot product and norm.

def dprod(a,b):
    sum=0
    for i in range(len(a)):
        sum+=a[i]*b[i]
    return sum

def norm(a):

    norm=0
    for i in range(len(a)):
    norm+=a[i]**2
    return norm**0.5

    cosine_a_b = 1-(dprod(a,b)/(norm(a)*norm(b)))
0

Here is an implementation that would work for matrices as well. Its behaviour is exactly like sklearn cosine similarity:

def cosine_similarity(a, b):    
    np.divide(np.dot(a, b.T),  np.linalg.norm(a, axis=1, keepdims=True) @ np.linalg.norm(b, axis=1, keepdims=True).T)
-3

All the answers are great for situations where you cannot use NumPy. If you can, here is another approach:

def cosine(x, y):
    dot_products = np.dot(x, y.T)
    norm_products = np.linalg.norm(x) * np.linalg.norm(y)
    return dot_products / (norm_products + EPSILON)

Also bear in mind about EPSILON = 1e-07 to secure the division.

Cody Gray
  • 230,875
  • 49
  • 477
  • 553
Areza
  • 4,781
  • 7
  • 40
  • 67