Vector space model: cosine similarity vs euclidean distance

Question

I have corpora of classified text. From these I create vectors. Each vector corresponds to one document. Vector components are word weights in this document computed as TFIDF values. Next I build a model in which every class is presented by a single vector. Model has as many vectors as there classes in the corpora. Component of a model vector is computed as mean of all component values taken from vectors in this class. For unclassified vectors I determine similarity with a model vector by computing cosine between these vectors.

Question: Can I use Euclidean Distance between unclassified and model vector to compute their similarity? If not - why?
Thanks!

Thanks! Still it is not clear to me why Euclidean distance can not be used as similarity measure instead of cosine of angle between two vectors and vice versa? — Anton Ashanin, Oct 16 '13 at 19:08
You can use the Euclidean distance, as far as you use an appropriate transformation rule, e.g:
$dist = 1 -sim$, $dist = \frac{1-sim}{sim}$, $dist = \sqrt{1-sim}$ or $dist = -\log(sim)$. However, it is important to remember that in general a distance is not a similarity. The latter one is subjective-driven (two objects $X$ and $Y$ are similar if their $sim(X,Y) \geq 0.85193$ ?). A distance, in contrast, is a real metric that follows (specific) well-founded properties... — NeuroMorphing, Mar 08 '17 at 00:30

score 4 · Answer 1 · answered Aug 28 '17 at 07:43

To complement other answers:

Cosine similarity of $x, y$ : $\frac{\langle x, y\rangle}{\|x\|\|y\|}$

Euclidean distance (squared) between $x, y$: $\|x-y\|^2 = \|x\|^2 +\|y\|^2 - 2\langle x , y\rangle$

Assuming that $x, y$ are normed

Cosine similarity: $\langle x , y\rangle$

Euclidean distance (squared): $2(1 - \langle x , y\rangle)$

As you can see, minimizing (square) euclidean distance is equivalent to maximizing cosine similarity if the vectors are normalized.

score 2 · Answer 2 · answered Mar 08 '17 at 00:35

You can use the Euclidean distance, as far as you use an appropriate transformation rule, e.g:

$dist = 1 -sim$, $dist = \frac{1-sim}{sim}$, $dist = \sqrt{1-sim}$ or $dist = -\log(sim)$.

However, it is important to remember that in general a distance is not a similarity. The latter one is subjective-driven (are two objects $X$ and $Y$ similar if their calculated similarity score $sim(X,Y)$ exceeds 0.85193 ?). A distance, in contrast, is a real metric that follows a number of well-founded properties. Have a look on "Encyclopedia of Distances"

score 2 · Answer 3 · answered Mar 08 '17 at 00:47

If you don't normalize the vectors to be all the same length then their length will depend on the length of the document. Usually, in document classification we don't want to be biased by the document lengths. This is one reason why cosine similarity is preferred.

Vector space model: cosine similarity vs euclidean distance

3 Answers3

Linked