I have a list of cities that I want to compare in terms of their similarity. Each city can described by a large but finite number of characteristics but most of them will have missing data for some random number of characteristics. If I consider each city to be a n-dimensional vector with with missing data in some dimensions, I'm thinking of using the cosine similarity to compare the cities so that only dimensions with data in both cities will be considered in the similarity. Is cosine similarity the most appropriate similarity measure to use when there's missing data in random dimensions? Since some cities may have a very high similarity result just by matching perhaps two dimensions while others aren't considered due to missing data - how can I show the relative "quality" of the results?
Asked
Active
Viewed 812 times
If I consider each city to be a n-dimensional vector with missing data turned to 0The question is whether recoding missing into valid 0 is warranted. For binary (0 vs 1) data 0 means "absent" while missing means "absent or present - not known". No similarity measure itself can help you decide how to go about missing data. – ttnphns Jun 10 '14 at 07:16