I have corpora of classified text. From these I create vectors. Each vector corresponds to one document. Vector components are word weights in this document computed as TFIDF values. Next I build a model in which every class is presented by a single vector. Model has as many vectors as there classes in the corpora. Component of a model vector is computed as mean of all component values taken from vectors in this class. For unclassified vectors I determine similarity with a model vector by computing cosine between these vectors.
Question: Can I use Euclidean Distance between unclassified and model vector to compute their similarity? If not - why?
Thanks!
$dist = 1 -sim$, $dist = \frac{1-sim}{sim}$, $dist = \sqrt{1-sim}$ or $dist = -\log(sim)$. However, it is important to remember that in general a distance is not a similarity. The latter one is subjective-driven (two objects $X$ and $Y$ are similar if their $sim(X,Y) \geq 0.85193$ ?). A distance, in contrast, is a real metric that follows (specific) well-founded properties...
– NeuroMorphing Mar 08 '17 at 00:30