4

I have records (rows) in a database and I want to compute similar records. I have a constraint to use cosine similarity. If the variables (attributes, columns) vary in type and come in this form:

[number] [number] [boolean] [20 chars string]

how can I proceed to the vectorization to apply the cosine similarity? For the string I can take the simple tf-idf. But for numbers and boolean values?. And how can this be combined?

curious
  • 1,101
  • 1
  • 8
  • 7
  • 1
    Cosine for boolean (binary) data is called sometimes Ochiai coefficient and has its formula, but general cosine formula of course is valid too. So, the only ambiguity remaining is with that string variable. tf-idf (as far as I know) isn't cosine. Well, what are similar and what are dissimilar records by that string for you?. Describe it - perhaps with examples. – ttnphns Mar 19 '13 at 17:34
  • I mean how to proceed to vectorization. Of course tf-idf isn't cosine. It's a way to vectorize a text. Can you be precise how to vectorize boolean and numbers in order to construct the vectors and fed the cosine with them – curious Mar 19 '13 at 17:37
  • What do you mean under vectorization? If it's unfolding a matrix into a vector then I don't see why you need it. If you have any data matrix records X numeric_attributes then you will be able to obtain a square symmetric matrix of cosine similarity between the records. – ttnphns Mar 19 '13 at 17:44
  • Vectorization is the first step of cosine similarity.Suppose i have two records. r1=234,1023,No,Today is Sunday. and r2=876,423,Yes,Tomorrow i am leaving. How i can compute the cosine of those 2 records?How i can compute their vectors?I will just take char by char their ascii representation and make a vector? Then there is no semantic and cosine might give inaccurate results – curious Mar 19 '13 at 17:52
  • If in your comment example one omits the last, string attribute (because cosine cannot be computed unless all the data are numeric) then the raw cosine (cosine computed on not anyhow standardized data) between r1 and r2 is .62468. – ttnphns Mar 19 '13 at 18:06
  • I think you don't understand. I give you two files, you know text files. And i want to compute the cosine , what am i doing? You don't compute it because these are not vectors? There is a process to make them vectors. In texts is known. You can take the term frequency of each word.But with numbers what are you doing? Also i didn't ask you to write me the result. Are you serious? This is an educative site. Explain me how you proceed. – curious Mar 19 '13 at 18:14
  • also look at my question. It seems like you don't understand what i am asking – curious Mar 19 '13 at 18:20
  • @curious are you thinking to compute cosine similarity via something like localtiy sensitive hashing (LSH)? If that's the case you get all the documents (or whatever your data are) onto a common space via hashing (usually using hash functions to hash to ints). Is this what you're asking about? – Lucas Roberts Nov 03 '21 at 01:37

1 Answers1

1

You can normalize each field - divide by mean value, etc. You can also weight normalized fields based on their importance from domain knowledge. after this you can apply standard cosine similarity math.