What is a good similarity measure to use when missing data is a significant issue?

Question

I have a list of cities that I want to compare in terms of their similarity. Each city can described by a large but finite number of characteristics but most of them will have missing data for some random number of characteristics. If I consider each city to be a n-dimensional vector with with missing data in some dimensions, I'm thinking of using the cosine similarity to compare the cities so that only dimensions with data in both cities will be considered in the similarity. Is cosine similarity the most appropriate similarity measure to use when there's missing data in random dimensions? Since some cities may have a very high similarity result just by matching perhaps two dimensions while others aren't considered due to missing data - how can I show the relative "quality" of the results?

If I consider each city to be a n-dimensional vector with missing data turned to 0 The question is whether recoding missing into valid 0 is warranted. For binary (0 vs 1) data 0 means "absent" while missing means "absent or present - not known". No similarity measure itself can help you decide how to go about missing data. — ttnphns, Jun 10 '14 at 07:16
Thanks for your comment, ttnphns - I agree that recoding missing data to 0 can be misleading. I've edited my question to clarify that I intend to use cosine similarity such that it will only consider dimensions where both cities have data. But I'm not sure how to show the relative "quality" or significance of the results. — KumaKuma, Jun 10 '14 at 17:38

What is a good similarity measure to use when missing data is a significant issue?

0 Answers0